Monday, June 08, 2009

Bloom Filters

During the weekend I was fascinated by a simple utility called a Bloom Filter. From the Wikipedia entry:

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
So I tried implementing my own generic Bloom Filter in C++ and submitted it to the Boost C++ Libraries. My email went a little something like this:

During the weekend, I got acquainted with and excited about Bloom
Filters and the utility of such a data structure. I then proceeded to
look at generic implementations of a bloom filter and I thought about
implementing one myself instead as a simple exercise.

Attached is what I came up with which I'm submitting as a library for
review (and uploading to the vault). Also attached is a sample program
which uses the bloom_filter.

This then led to a few responses, and now I'm proud to say that my implementation can now be checked out and tracked from Boost Sandbox! The URL to the subversion repository is:
https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/
Have fun with the library, and I would definitely like to know what you think about it!

7 comments:

Christopher Smith said...

I'm surprised that with all the things you templated on, you didn't template on the hash functions.

Tim Weiler said...

Could you post a small example of it's use?

Tim Weiler said...

Never mind, just noticed you put the sample in the attachment.

Dean Michael said...

Hi Chris, I'm actually in the process of still refining it so that instead of just the number of hash functions (K), you'd actually put a Boost.Fusion-compliant sequence as part of the template parameters. The sequence would then contain the types of compliant hash function object types. :)

Thanks for dropping a line!

Dean Michael said...

Tim, you can also try using the code in the test directory as well in the subversion repository as a guide. I'm in the process of refining the implementation, based on comments received from the Boost Developers mailing list. Hopefully I'll be able to come up with a high quality implementation soon. :)

Thanks for dropping a line Tim!

Paul said...

What about a Counting variant?
See wikipedia page
http://en.wikipedia.org/wiki/Bloom_filter

would it be easy to change a tag and get a counting-bloom-filter instead of a plain bloom filter?

Dean Michael said...

Hi Paul, that's one of the requests from the Boost developer community. At the moment I'm suffering from lack of available time for hacking on this implementation. Definitely tag-based dispatch would allow that to be possible -- and I'll need to do some major surgery (and re-reading) to make that happen.

The problem with such an approach is the implementation of an efficient segmented bitset (because you'd need to have more bits to contain the count). If you would notice I rely on Boost.Dynamic_bitset and the STL bitset. Implementing a segmented bitset may be a different problem in itself much like how to implement a compressed bitset (for a compressed Bloom Filter).

Thanks for dropping a line though, I'll try and hack on it when I get the time to do so.