Monday, June 08, 2009

Bloom Filters

During the weekend I was fascinated by a simple utility called a Bloom Filter. From the Wikipedia entry:

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
So I tried implementing my own generic Bloom Filter in C++ and submitted it to the Boost C++ Libraries. My email went a little something like this:

During the weekend, I got acquainted with and excited about Bloom
Filters and the utility of such a data structure. I then proceeded to
look at generic implementations of a bloom filter and I thought about
implementing one myself instead as a simple exercise.

Attached is what I came up with which I'm submitting as a library for
review (and uploading to the vault). Also attached is a sample program
which uses the bloom_filter.

This then led to a few responses, and now I'm proud to say that my implementation can now be checked out and tracked from Boost Sandbox! The URL to the subversion repository is:
https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/
Have fun with the library, and I would definitely like to know what you think about it!

Wednesday, May 27, 2009

MapReduce without using MapReduce?

Recently I had the chance to work on a few distributed parallel computing solutions for data analytics. One of the interesting choices I've had to make is whether to use the MapReduce computing model (using Hadoop) or whether to come up with an adhoc way of doing it.

At first it wasn't a very easy decision to make the forces I was facing were:

  1. I haven't had experience working with Hadoop before, and the last thing I did with it is the sample program that comes with it. Although I can understand that it is easy to use, there are other operational issues to consider.
  2. Hadoop requires that you put the data into the DFS and while this is convenient to deal with in a program, operationally that's usually not an option. Imagine dumping data from a huge (production) database and then massaging it into a format that you can deal with through Hadoop.
  3. The MapReduce model works as a parallel computing model and is simple enough for the kinds of analytics I intend to do. However, implementing a framework from scratch seemed like a daunting challenge especially since there was already an existing solution in Hadoop.
Given the above thought process, I started looking elsewhere. One of the first things I turned to was the Boost C++ Library and remembered that Boost.MPI is available to be used. Since I was already using Boost in my projects, it seemed a logical choice to stick with it.

I then proceeded to install OpenMPI which Boost.MPI had been tested against. At first I was skeptical whether it would be possible to use it in a data-parallel solution granted that it is a message passing implementation. Then I wrote the code and found out that it was definitely possible to do MapReduce with MPI.

The MapReduce Pattern

Basically there is a simple pattern behind the MapReduce computing model. There are two steps: the Map step and the Reduce step. In between you have shuffles, etc. but the gist is that you need a way to map computation to data, and the results of the individual computations are then reduced to yield the answer. It's basically a divide-and-conquer approach to dealing with data in parallel.

If you were doing this with MPI, you'd start with the 'scatter' function call. In Boost.MPI, you'd have some code like this:

vector<int,int> ranges;
// populate ranges
pair<int,int> range;
scatter(world, ranges, range, 0);

Once you have your ranges scattered, you then deal with the data you get in the individual nodes. Your ranges may represent bounds in the database, maybe a part of a file shared accross the cluster, or just a range of numbers. This is the easy part, now the challenge becomes the reduce part.

With that you need to think a little bit about what kind of data structure you're going to be transmitting from the worker nodes to the root node -- and how to actually "merge" the results to yield what you need in the end. Once you're settled with that, you can then implement the merge (or the reduction step) as a function object. In the following example I implement a merge of maps of counts as a functor:
struct merge_maps {
map operator() (map<int,int> l, map<int,int> const & r) {
l.insert(r.begin(), r.end());
return l;
}
};

namespace boost { namespace mpi {
template <>
struct is_commutative<merge_maps,map<int,int> > : true_ {};
}}
We then use it with the Boost.MPI 'reduce' implementation:
map<int,int> partial_map; // partial from the nodes
map<int,int> final_map; // the "merged" map
reduce(world, partial_map, final_map, merge_maps(), 0);
After this step, you basically have the final results in the 'final_map'.

And that's it! You have an adhoc MapReduce implementation using Boost.MPI!

Of Course...

This isn't really MapReduce, but with all this talk about parallel computing and massively parallel distributed computing you can say that there are many ways of addressing the issue using alternative technologies.

It'd be nice to know how you deal with your parallel distributed computing needs.

Friday, April 10, 2009

Function Input Iterator (Boost Submission)

A couple of weeks ago I've submitted an implementation of a "Function Input Iterator" that builds upon the implementation of the generator iterator adaptor. The function input iterator implementation uses a separate state object that tracks the state of the iterator. The state encapsulated by the iterator is used to compare whether two iterators are equal.


The library has been submitted as an additional library to the Boost.Iterators library and is part of the Trac ticket #2893. From the ticket submission, the following details are then quoted.

Simply put a the function input iterator implementation wraps a nullary function object and allows for bounded iterators using an embedded state variable. The use case is for something like the following:

struct generator {  typedef int result_type;  generator() { srand(time(0)); }  result_type operator() () const { return rand(); } };  using namespace std; using namespace boost;  int main(int argc, char * argv[]) {  generator f;  copy(    make_function_input_iterator(f, infinite()),    make_function_input_iterator(f, infinite()),    ostream_iterator(cout, " ")    );  return 0; } 

And if you don't want an infinite range, you can simply bound it:

copy(   make_function_input_iterator(f, 0),   make_function_input_iterator(f, 10),   ostream_iterator(cout, " ") );

Sunday, March 15, 2009

GCC C++0x Features Exploration

In the interest of exploration, I played around one Sunday afternoon with the C++0x features available from GCC 4.3.2 [1] -- to better see what I can use at work when I migrate the application I'm writing from the C++98 standard to the C++0x standard. To gauge the differences, I tried implementing a simple stream-fusion approach to simultaneously compute the the mean, minimum, and maximum in a collection of integers.

For a little background, let me try and describe what stream fusion is briefly. Simply:




Stream Fusion

Technique of removing multiple loops over a container and performing the required computations in a single pass. This technique is meant to minimize the number of loop iterations hopefully to improve the performance of an application. [2]

Stream fusion is a good way to improve performance in applications that deal with huge containers and that perform multiple passes on the container to do multiple things. Sometimes the technique is a little more tideous to apply than the naive solutions but when performance matters it's something you can try.

Let's look at a naive version of getting the mean, minimum, and maximum of
an array of integers:


 
int numbers[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int min = min_element(numbers, numbers + 10);
int max = max_element(numbers, numbers + 10);
double mean =
static_cast<double>(accumulate(numbers, numbers + 10))
/ 10.0
;

Here, we have three loops that compute the mean, minimum, and maximum from
the same container. From the way it looks, it's just fine that we have three loops since we just have 10 elements in the array.

But what if for example we were reading the integers from a file and we have millions of integers to deal with, then going over the file three times doesn't seem like a viable solution especially if performance matters. At that scale of the problem, you can count on performance being a big deal. It also doesn't make sense for us to keep everything in memory either because the data may not fit in memory or the data actually comes from an input
iterator bound to a stream.

Stream fusion then becomes more obvious because the reason it's called stream fusion is because usually the computations are associated with going through data that comes from a stream. Going "back" in a stream usually doesn't make sense or performance-wise is very costly, so we try and keep
the computation going as we consume items in the stream to save on space consumption and to get the performance enhancements.


Stream Fusion, C++98 Style


So first thing we try is to do stream fusion with the C++98 standard. We'll use the TR1 tuple implementation that comes with GCC and the standard algorithms as well as a special computation type we'll callmean_min_max. First, we'll think about how to simultaneously compute the mean, minimum, and maximum.

To do this, we'll have to be able to keep track of the following variables:




  • minimum so far : we'll call this MIN

  • maximum so far : we'll call this MAX

  • total so far : we'll call this ACC

  • count of elements dealt with so far : we'll call this COUNT



What we want to do is for each iteration of the loop, we'll look at one element and see if it's the minimum so far, maximum so far, add it to ACC, and increment COUNT. To do this in functional programming means we'll have to rely on a left-fold and make mean_min_max a binary function object taking a state variable and an element as the respective arguments to the function operator overload.

To make it as generic as possible, we implement mean_min_max as a polymorphic function object type. We come up with the C++98 compliant version of mean_min_max below:


 
struct mean_min_max {
template <class T, class U>
T operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return make_tuple(
// This is the running mean...
static_cast<double>(get<ACC>(state) + element)
/ static_cast<double>(get<COUNT>(state) + 1),

// This is the minimum so far
(get<COUNT>(state) == 0)?
element
: (get<MIN>(state) > element)?
element
: get<MIN>(state),

// This is the maximum so far
(get<COUNT>(state) == 0)?
element
: (get<MAX>(state) < element)?
element
: get<MAX>(state),

// This is the new value of ACC
get<ACC>(state)+element,

// This is the new value of the COUNT
get<COUNT>(state) + 1
);
}
};

We can then use this as the fourth argument to the call to accumulate while we use the tuple (0.0, 0, 0, 0uL, 0) as the initial state of the left fold:


 
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
make_tuple(mean, min, max, 0uL, 0),
mean_min_max()
);

If we notice the call to accumulate we'll see that there's nothing special about this construct. Sure, we've accomplished what we want to accomplish with just the C++98 standard. You might be wondering what the problem with the above implementation could be, and in the next section I'll detail what those issues are.


Stream Fusion, C++0x Style


Now in the above solution, we have quite a number of inefficiencies that aren't readily obvious. First of these are the inordinate number of copies we'll be doing when returning the tuple from mean_min_max. In C++98 we're stuck with copies when returning data from functions. Unless we use something like an auto_ptr<...> as a return and use the heap, we don't have move semantics.

But in C++0x, we now have support for move semantics as an addition to the language. We rely on two features: r-value references and the move library function. So first, let's talk a little about what r-value references are and why they're important.




R-Value References

R-value references are denoted by && right after a type. This is used to denote that only temporary values and literal constants can be referred to by an r-value reference. These r-value references actually take on the storage of the temporary value thereby not needing to allocate new storage for a new variable. It is then said that the temporary is 'moved' from the original scope to the new scope without relying on copies and additional temporaries. [3]



With a C++0x implementation of the mean_min_max we can change a few things that would allow the solution to minimize on copies:


 
struct mean_min_max {
template <class T, class U>
T && operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
};

Here we change the result of the function call operator to be an r-value reference of type T and that we should move the temporary created in the body to be returned into the storage of wherever the result is to be put. Instead of the compiler calling the copy constructor of the tuple, it then relies on the move constructor (or a constructor that takes a T && as
the sole argument instead of a T const &) to initialize the values of the appropriate members.

If you have enough iterations and are dealing with heavy objects, r-value reference returns and move semantics will save you a lot of unnecessary copies. The less copies you make, the better the performance you get.

Another chance where we can use this technique is at the call to accumulate where we move the initial temporary tuple instead of letting the compiler copy the tuple for us:


 
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(
make_tuple(mean, min, max, 0uL, 0)
),
mean_min_max()
);

GCC and C++0x Lambda's


Unfortunately, GCC 4.3.2 doesn't yet support C++0x Lambda's [4]. I haven't checked yet if GCC 4.4.x [5] already supports it, but since lambda's aren't supported by the compiler I was testing, I'm not yet able to show a solution to the problem without testing it. However, if your compiler already supports the C++0x lambda's, you can go ahead and move body of the mean_min_max function into a lambda like this:


 
double mean = 0.0;
int min = 0, max = 0;
typedef tuple<double, int, int, unsigned long, size_t> accumulator_tuple;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(make_tuple(mean, min, max, 0uL, 0)),
accumulator_tuple &&
[](accumulator_tuple & const state, int const & element) {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
);

If you're just making throw-away functions, lambda's are an attractive approach to keep the context of the computation closer to the point where you need the computation to be done. This is great if you are adept at using the STL algorithms and are comfortable with functional programming principles.

In this example the Lambda's noticeably don't have parametric types in the construction. This is because C++0x are monomorphic -- which means they only have one form. This makes C++0x lambda's a little less flexible compared to hand-coded function objects that have template member overloads of the function call operator (like our version).


More to Come


In this article I've highlighted important new features of C++0x which should allow developers to be more productive and be able to produce better performing code with R-value references, move semantics, and lambda's. There are a handful more new features upcoming at the language level which I would like to explore as well in upcoming articles.

All in all, C++0x allows you to write powerful and performant code that when combined with powerful idioms (like stream fusion) yields superior solutions. Even though the C++98 standard allows you to write good code, C++0x just allows you to write better code in more ways than one.







[1]GCC 4.3.2 http://gcc.gnu.org/gcc-4.3/






[2]Stream Fusion http://www.cse.unsw.edu.au/~dons/papers/CLS07.html






[3]R-value References http://www.open-std.org/JTC1/sc22/wg21/docs/papers/2004/n1690.html






[4]Lambda's http://en.wikipedia.org/wiki/C%2B%2B0x






[5]GCC 4.4.x http://gcc.gnu.org/gcc-4.4/changes.html

Monday, March 09, 2009

New Specification Library Release (0.2.1)

Recently, a few people have asked about my C++ implementation of a Behavior Driven Development library which was previously hosted at the Orange and Bronze website under my home directory. Apparently, there were a few changes made to the O&B site which disallowed my serving of the file releases from there so I thought about moving the file releases to my own domain. Having said that, there are a few things which have happened since the move, and I'm documenting those changes (and the new releases) in this blog entry.

If you haven't checked the site recently, I've added a dedicated file storage location for the Specification Library (which I've unimaginatively named as spec-cpp). From there you can get the latest releases of the library as well as get a sneak peak into what the library offers and supports.

The latest version line is the 0.2.x version which adds early container inspection support in the form of statements like:

container(some_container).should.contain(value);
container(some_container).should.not_contain(value);
container(some_container).should.not_be_empty();
container(some_container).should.have_size(value);

More releases along the 0.2.x line will concentrate on inspecting STL and STL-conformant containers. As of the moment I develop the library on my free time and release all the source code as open source using the Boost Software License.

This release has been tested against Boost 1.38.0 using the GCC 4.3.2 compiler in Linux.

If you'd like to contribute or just get the latest and greatest version of the code, please head on over to the github repository at http://github.com/mikhailberis/spec-c--.

This blog post is cross-posted in deanberris.com and blog.cplusplus-soup.com -- if you have questions or comments regarding the library, please send them to me@deanberris.com.

Update: Adding correct link.