MapReduce without using MapReduce?
Recently I had the chance to work on a few distributed parallel computing solutions for data analytics. One of the interesting choices I've had to make is whether to use the MapReduce computing model (using Hadoop) or whether to come up with an adhoc way of doing it.
At first it wasn't a very easy decision to make the forces I was facing were:
- I haven't had experience working with Hadoop before, and the last thing I did with it is the sample program that comes with it. Although I can understand that it is easy to use, there are other operational issues to consider.
- Hadoop requires that you put the data into the DFS and while this is convenient to deal with in a program, operationally that's usually not an option. Imagine dumping data from a huge (production) database and then massaging it into a format that you can deal with through Hadoop.
- The MapReduce model works as a parallel computing model and is simple enough for the kinds of analytics I intend to do. However, implementing a framework from scratch seemed like a daunting challenge especially since there was already an existing solution in Hadoop.
I then proceeded to install OpenMPI which Boost.MPI had been tested against. At first I was skeptical whether it would be possible to use it in a data-parallel solution granted that it is a message passing implementation. Then I wrote the code and found out that it was definitely possible to do MapReduce with MPI.
The MapReduce Pattern
Basically there is a simple pattern behind the MapReduce computing model. There are two steps: the Map step and the Reduce step. In between you have shuffles, etc. but the gist is that you need a way to map computation to data, and the results of the individual computations are then reduced to yield the answer. It's basically a divide-and-conquer approach to dealing with data in parallel.
If you were doing this with MPI, you'd start with the 'scatter' function call. In Boost.MPI, you'd have some code like this:
vector<int,int> ranges;Once you have your ranges scattered, you then deal with the data you get in the individual nodes. Your ranges may represent bounds in the database, maybe a part of a file shared accross the cluster, or just a range of numbers. This is the easy part, now the challenge becomes the reduce part.
// populate ranges
pair<int,int> range;
scatter(world, ranges, range, 0);
With that you need to think a little bit about what kind of data structure you're going to be transmitting from the worker nodes to the root node -- and how to actually "merge" the results to yield what you need in the end. Once you're settled with that, you can then implement the merge (or the reduction step) as a function object. In the following example I implement a merge of maps of counts as a functor:
struct merge_maps {We then use it with the Boost.MPI 'reduce' implementation:
mapoperator() (map<int,int> l, map<int,int> const & r) {
l.insert(r.begin(), r.end());
return l;
}
};
namespace boost { namespace mpi {
template <>
struct is_commutative<merge_maps,map<int,int> > : true_ {};
}}
map<int,int> partial_map; // partial from the nodesAfter this step, you basically have the final results in the 'final_map'.
map<int,int> final_map; // the "merged" map
reduce(world, partial_map, final_map, merge_maps(), 0);
And that's it! You have an adhoc MapReduce implementation using Boost.MPI!
Of Course...
This isn't really MapReduce, but with all this talk about parallel computing and massively parallel distributed computing you can say that there are many ways of addressing the issue using alternative technologies.
It'd be nice to know how you deal with your parallel distributed computing needs.



1 comments:
In mapreduce you don't need to work with DFS (or GFS) data. In fact, probably the most efficient model to use it in conjunction with a distributed database like hbase or hypertable which are bigtable equivalents.
Your distributed database can then just be the backend for whatever website you use.
However, I'm pretty sure that people have done maps and reduces from plain old sql style databases. The thing to keep in mind there is that your IO may be limited by MySql, as you will have a great number of nodes running mappers pegging your single MySql instance.
Map reduces input and output can be more or less anything. Probably most example code are going to be based on using files on HDFS, but that's more or less the "hello world" of map reduce IO.
Post a Comment