Saturday, October 03, 2009

Elements of Programming: A Review (Part 3)

In Part 2 I showed the beginnings of a generic Range library implementation that approaches the problem by implementing a generic algorithm that deals with a Concept of a Range. In this third installment to the review of the book Elements of Programming by Alexander Stepanov and Paul McJones, I intend to present a logical thesis on how to approach application development at a higher level using generic programming techniques along with the strong mathematics and Concept-based approach that the book advocates.


Set-Up

So first, let's assume we're writing a simple application that takes a file input that contains a set of coordinates in a cartesian plane. These coordinates show a set of points connected to each other. This is a fully-connected graph.

Our application is supposed to read all the coordinates and then store it in memory when the application starts. First we don't specify what kind of data structure we use, just that we're going to put it in memory. Second, we don't assume how we store it in memory or whether we put it in a database of some sort -- just that we store it somewhere. Third, we don't know yet how each coordinate is stored but rather that we're just going to assume we're going to have it stored somehow.

At this point of the description, the application isn't really that interesting. However what it's supposed to do is what's interesting.

Ready for Input

Once the application has read all the coordinates from the file, the application is then ready for input. At this state, the application will take the input from the standard input device (or std::cin) in the form of three numbers separated by white space -- which signifies another cartesian coordinate, and a threshold distance value.

First, we assume that the numbers are within the range [-INTMAX,+INTMAX) where INTMAX is the largest possible value represented in an 'int'. Second, we assume that we can store the coordinate the same way we store a coordinate we read from the file. Third, the absolute value of the threshold is used -- therefore negative values become positive values.

So far so good, but then we get to the meat of the application.

The Problem/Solution

What we want to do is find all the coordinates that are within the threshold distance from the input coordinate and print to standard output the coordinates in the form "[x,y] d" where 'x' is the coordinate on the x axis, y is the coordinate on the y axis, and d is the distance to the input coordinate. If you draw a circle around the input coordinate at its center with radius of the threshold distance, then the coordinates that lie within the circle are those coordinates we're looking for.

If you can imagine it in your head, what we're going to have to do is:

  • Compute the distance between each coordinate and the source coordinate.
  • Filter out the coordinates that aren't within the threshold distance.
  • Display each coordinate in the prescribed format through standard output.

Understandably, we should be able to use already available STL containers and algorithms to do this. We can even model each coordinate as a std::pair<int,int> -- and then store them in a list.

In this first incarnation of the solution, we have a few distinct steps which are:

  1. Load the coordinates from the file.
  2. Get the distances and store it along with the pair into another list.
  3. Filter this list and remove the lists that aren't within a certain distance.
  4. Print the filtered list.

There is nothing wrong with this implementation per se -- however I see that we have three lists. We can optimize this implementation by removing the filtered list, and then hand-writing the iteration of the list of coordinates and distances.

Concepts

So in this implementation we're seeing that there is a certain concept called the coordinate. These are some basic operations defined against a Coordinate that should be valid (via ADL perhaps):

  • coord_x(c)
  • coord_y(c)
  • distance(c1, c2)

In these operations, we assume that 'c', 'c1', and 'c2' are instances of a type that models the Coordinate concept. This gives us a set of primitive operations which we can work with in our application. In this domain of coordinates, we can then count on having the distance between two coordinates and retrieving the x and y components of the coordinate.

Algorithms

Here we can then define the algorithms we need to implement. In a generic (and orthogonal) fashion we find that we have a few basic algorithms required to satisfy the end result that we want to achieve in the program:

  • read_coordinates(input_stream) -> range<input_iterator>
  • copy_filtered(coordinates, filter, render, output) -> void

The read_coordinates(input_stream) algorithm returns a range of values. For this case we use the Boost.Range concept of a Range which is a pair of iterators for the sake of brevity. Below is a simple implementation of the read_coordinates algorithm:

template <class InputStream>
iterator_range<typename InputStream::iterator>
read_coordinates(InputStream & input_stream) {
  return iterator_range<typename InputStream::iterator>(
    istream_iterator<coordinate>(input_stream),
    istream_iterator<coordinate>()
    );
}

template <class InputStream>
InputStream & operator>>(InputStream & is, coordinate & c) {
  is >> c.x >> c.y;
  return is;
}

This version of read_coordinates relies on the std::istream_iterator<> template that relies on an overloaded shift right (operator>>) implementation that works with an istream and an instance of a coordinate type. This is why we overload the operator>> too above.

The we look at the algorithm for copy_filtered(coordinates, filter, renderer, output). Below is this generic implementation that takes four arguments:

template <class CoordinateRange, class Filter, class Renderer, class OutputIterator>
void copy_filtered(CoordinateRange coordinates, Filter filter, Renderer render, OutputIterator output) {
  typename iterator_traits<CoordinateRange>::type iter = begin(coordinates),
    end_ = end(coordinates);
  while (iter != end_) {
    if (filter(*iter)) {
      *output++ = render(*iter++);
    }
  }
}

In the above algorithm we see that we treat coordinates as a range, and then iterate through each item to check if the filter algorithm returns true, we copy the rendered item to the output iterator. This is a very generic filter algorithm that we can use not only in the scope of the program we're building but also other programs. However, in our domain, we may encounter a few snags in terms of efficiency. Let me try and help by pointing out a few things:

  • In our application, we intend to filter based on the distance between a certain coordinate and the coordinates we read from the file. This means we have to compute the distance in the filter function and the render function.
  • The algorithm above assumes that the filter function will return only a bool. We can change it to get a pair of values instead, a boolean to indicate that it qualifies and the distance. The problem with that approach is that the algorithm then becomes very narrow and specific to our application.
  • The renderer's return type should be convertible to the output iterator's value type -- this kind of undue coupling is something that is a fact of generic programming (even the STL does this with the copy algorithm). However in our application, the renderer -- which presumably will turn a coordinate into a string given the format we intend to output, will also need to compute the distance.

This isn't turning out very well for our approach; although we've identified the algorithms we'll be requiring and the concepts we're going to work with, we aren't able to make an implementation yet that is as efficient as hand-coding everything into a single loop. The first incarnation made too much use of memory by making different lists -- that approach won't scale if we have a huge list of coordinates to go through and we keep making extra lists. This second approach eliminates the intermediate lists but computes the distances between the input coordinate and each coordinate from the file twice.

Third Time's the Charm

So let's back up a bit here. Up until this point we've been trying to make generic algorithms from our simple program. Our first incarnation would work although it consumed too much memory. Our second incarnation was a little alright although we're doing multiple computations of the distance. Let's then see how we can do if we can introduce some laziness in our implementation.

First, let's look at our mathematical requirements:

  1. We have a tuple (x,y) which represents a point in a two dimensional integral plane.
  2. We have a function distance(p1, p2) which represents the distance between two points.
  3. We want to filter the points whose distance from a source point is within a certain threshold value.
  4. We want to print these points and their distance from the source point.

From here we then transform this into a nested pseudocode representation:

  print( filter( _ < threshold, transform(distance(source, _), points) ) )

The idea above is that we want to "print" all "filtered" points, where the distance from a source is less than a threshold.

First problem we encountered (and dealt with) was that we had intermediate lists. We then imaging dealing instead with iterators in this solution. Second problem was that we computed distance twice -- first in the filtering, and second time in the rendering.

To solve both problems, we want to be able to transform each point as just a single tuple (x,y), into a bundling of the coordinate and the distance. So now we're looking at a tuple ((x,y), d) where d is the distance -- the reason we want to do this is so that we bundle the coordinate with the distance, so that we can filter them efficiently using our filter algorithm.

Let's see how we can implement our 'distance_bundler' function object:

template <class C>
struct distance_bundler {
  typedef tuple<C, double> result_type;
  distance_bundler(C const & source) : source_(source) {}
  tuple<C, double> operator() (C coord) {
    return make_tuple(coord, distance(source_, coord));
  }
  C source_;
};

So here we've made an effort already to make a specific data representation using our approach. However, consider that we are using a generic 'distance' function which can be easily retrieved via Argument Dependent Lookup (ADL) based on the type C's namespace.

Next we implement a unary predicate function that takes a coordinate-distance bundle, and returns true if the distance is within a threshold value. This is how we can implement that filtering predicate:

struct within_threshold {
  typedef bool result_type;
  within_threshold(double threshold) : threshold_(threshold) {}
  template <class C>
  bool operator() (tuple<C, double> distance_bundle) {
    return get<1>(distance_bundle) < threshold_;
  }
  double threshold_;
};

In both the preceding implementations, we can choose to use BCCL to make sure the type 'C' models the Coordinate concept. I purposely left that out here to yield more readable examples.

Our next step would be to transform a qualified coordinate and the distance to the source into a string which we can then output. Here we require a 'render' function which will turn a coordinate bundle into an STL string which we also implement:

template <class C>
string render(C const & c) {
  ostringstream output_stream;
  output_stream << "[" << coord_x(c) << "," << coord_y(c) << "]";
  return output_stream.str();
}

struct bundle_printer {
  typedef string result_type;
  template <class C>
  string operator() (tuple<C, double> bundle) {
    ostringstream output_stream;
    output_stream << render(get<0>(bundle)) << " " << get<1>(bundle);
    return output_stream.str();
  }
};

Again above we can use the BCCL to enforce that the type 'C' models Coordinate.

Next thing we do is we combine these elements into a lazily-applied construction that uses Boost.Iterators to stitch together the individual moving parts. The code follows, which is explained in greater detail later.

transform(
  make_filter_iterator(
    within_threshold(threshold),
    make_transform_iterator(
      begin(input_range),
      distance_bundler(source)
    )
  ),
  make_filter_iterator(
    within_threshold(threshold),
    make_transform_iterator(
      end(input_range),
      distance_bundler(source)
    )
  ),
  ostream_iterator<string>(
    cout, "\n"
  ),
  bundle_printer()
);
If you happen to be using Boost.RangeEx you can use the algorithm version that supports ranges:

transform(
  make_filtered_range(
    within_threshold(threshold),
    make_transformed_range(
      input_range,
      distance_bundler(source)
    )
  ),
  ostream_iterator<string>(cout, "\n"),
  bundle_printer()
);

Or:

transform( 
    input_range
    | transformed(distance_bundler(source)) 
    | filtered(within_threshold(threshold)
    , ostream_iterator<string>(cout, "\n")
    , bundle_printer()
);

So in the preceding three sample listings, instead of applying the algorithms to different containers we turn them into lazily-applied functions to each element in the input range. Here we feed directly from the file into memory (in the form of a coordinate object that has been read from the input stream) then apply the transformations and filtering as we go through the coordinates.

We also preserve the information required from the initially applied function 'distance_bundler' by bundling it with the result so that it can be used in the filtering -- and therefore have the computed value of the distance only computed once. This chaining approach and using generic algorithms to solve problems is a key theme being proposed by the book at varying levels.

Conclusions

Using the techniques espoused by the book "Elements of Programming" we can come up with a means of showing how we can use these techniques and apply them when building applications. Using generic algorithms and functions that do specific things at varying levels for both efficiency and correctness is a good practice that yields the desired result.

One of the biggest thrusts of the book is that of composition using building block functions in order to solve problems without having to resort to overly complex structural solutions. Instead of relying on object oriented programming alone, we can leverage functional programming and generic programming techniques to achieve efficient and expressive code. The role of composition in problem solving and extending software in general is something developers should be using more and more.

Instead of trying to re-inventing algorithms we should strive first to see if there are already algorithm implementations out there that we can leverage to allow us to solve problems faster by implementing the parts that are specific to our application. Short of using frameworks, libraries that are generic enough lend themselves to better re-use in more contexts than niche targeted solutions in the form of frameworks and platforms.

This concludes the three-part series on the coverage of the "Elements of Programming" book by Alexander Stepanov and Paul McJones. Post comments for questions, suggestions, clarifications, and further elaboration.

Friday, August 28, 2009

Memcache++ Updates (New Releases!)

The past week has been a big open source week for me which started from the weekend that spilled into the week. I've been updating and improving the implementation of the Memcache++ Client which I've released the first version of (0.9) last February 2008. One full year and some months afterwards I got an email from a former client who asked me if I can improve the client to handle newer memcache protocols that I first didn't need, and some older ones which he does need. Needless to say, I went at it and the product of which is a cleaner implementation that also supports more commands than it used to.


Before I go dive into the details of the implementation, let me first give you a history on the Memcache protocol, why it's important, and the project that led to the development (and release) of the memcache++ client library.

History

So back in February 2007, I started work with one of the largest social networking sites in the world. While there, I found myself in the middle of a project that was meant to increase the performance and scalability of an internal service that was crucial to the proper functioning of the site. One of the approaches taken was to leverage memcached to store frequently accessed and seldom changing data.

Without going through the whole project's history, we first used libmemcache which was the easiest thing to do at the time. It worked for a while, until we saw that there were some issues with the library because it made too many connections to the memcached server and this degraded performance over time. This may have been addressed with tuning the server on which the software we wrote (which was in C++) was running on, but then that meant operational overhead if we added machines or later encountered a spike that causes the number of connections to balloon to an insane number. While it worked, it was not the scalable solution that we were looking for.

For a while we tried to stick with it, but performance and stability tests kept complaining that over time the application (because it opened and closed lots of connections to the memcached servers) became unstable and performance numbers would suffer. This meant we either had to abandon the memcached idea altogether or we just write our own library to access the memcached servers from scratch. So we tried the writing our own library approach, and we did it from the ground-up with C++ forgoing the C implementation.

After a month or so of heavy development and integration work, we were finally able to meet the performance requirements of the internal service. Not only that, we were also able to make the library a separate component that just did what we needed -- access memcached, pool a set of connections to many servers, and re-use long-lived connections instead of opening and closing new sockets everytime we need to issue requests. The performance numbers were impressive and so we were happy that we got to where we needed.

Open Source

Then the idea came to mind: why don't we open source this library that we worked so hard to implement internally to get us to the point where we wanted to be (in a position to succeed) -- after all, memcached was instrumental to the success of other sites and was open source, why not the client library? Besides, it's one way for us to give back to the open source community on which a lot of the software we've written is running on.

So after a short discussion with the powers that be, the library was open sourced on January 2008 -- the project was registered on Sourceforge and the first release was uploaded on February 2008. This was met with some buzz, but apparently there aren't a lot of C++ projects that required the use of a memcache client out in the open. It seems most of the web projects that needed memcache were written in languages other than C++ -- hardly disheartening, because just a short few days later some hits came through and there were some users out there who were thinking about using the client in their C++ applications too.

Why Memcache?

So is memcached just for the web? I'd say hardly so. Let me give you a few examples where you'd want to be able to use memcache in your non-web (C++) applications:
  • Multiple Reads, Minimal Writes -- if you have an application that does a huge number of selects from a database from multiple sources (maybe different threads or different machines) and the data being queried rarely changes or is not time-critical data, then that means you can safely cache this either in your application or to an external cache provider. This external cache provider can be memcached and since you can store practically anything in memcache, you can definitely leverage this and alleviate your database from the strain of multiple readers and infrequent writers.
  • Shared Data -- Say you have an application which appends information to a queue or a file and you'd like to be able to share this across many instances of your application, you might think about putting this data straight to a database and then querying the database anytime you need that data. The problem here is that your bottleneck becomes the Database and it's pretty hard to fix that bottleneck if you rely on it for your shared data needs. Enter memcached which supports "append" and "prepend" operations -- so instead of just writing to the DB, you can also append or prepend the data to a key in memcached and just query that across instances of your application.
  • Atomic Counters -- Sometimes you might want to keep just a count of the number of times something is done (say the number of times a particular piece of data is accessed or modified). Typically you'll use a DB too so that you can query this information anytime you need it -- problem is that sometimes operations where you do COUNT() on primary keys from a DB would be resource-intensive especially if you have a large table. Enter memcached's 'incr' and 'decr' operations where you can set a key to be '0' and merely increment that key when a piece of data is accessed or modified. You can then query this information as often as you like alleviating the strain this introduces to your database.
There are all sorts of places in Enterprise or High Performance applications where you may opt to cache data and usually this is where memcached would be a good fit. Of course, if you just keep in mind that the data is just cache data and data you can afford to lose, then it fits perfectly into a solution where you can access the cache first (if it's not there, then go to the canonical source -- usually a database server).

The Client

Memcached is the server which maintains a hash map in memory -- mapping keys to values. Values can be anything, and keys can be strings that don't have control characters or spaces in it. It can be a UUID, an aggregated string, a random value, an MD5 hash, or just any string you want to associate data to. Although there's a lot of cool technology in the memcached server, the client we've released (and now I maintain) leverages this server-side technology by offering a fast and stable interface to setting and retrieving the data.

The Memcache++ client offers the following features:
  • Support for memcached's 'get', 'set', 'add', 'replace', 'append', and 'prepend' commands.
  • Supports redundancy pools where you can associate more than one memcached server to a pool identifier and when getting data from this pool you try all the members of the pool to retrieve the data; when setting data on this pool, all 'set', 'add', 'replace', 'append', and 'prepend' operations are mirrored across members of the pool for redundancy.
  • Uses persistent connections per memcache client handle: using one handle per thread allows you to maximize multi-core machines without having to synchronize operations on a single handle.
The latest release (came out last Wednesday, August 26, 2009) is version 0.11.1 which supports all the above features mentioned. In the following days I may release version 0.11.2 which adds a little more polish to the implementation.

These releases are just gearing up for release 1.0 which should support the "compare and set" (cas) operation, multiple get support, proper 'incr' and 'decr' support, among other performance and implementation enhancements.

Try it Out!

So if you have a C++ application that you think would greatly benefit from a separate cache provider like memcached, try out the library and let me know what you think! Hopefully it allows you to achieve performance and stability too like it allowed us to have back then -- and up until now.

You can get the source code through two means:
The author is also available for consulting in case you need to extend the library for your own needs and would rather not have the changes released as open source. The library is released under the Boost Software License which allows for the library to be used in commercial and proprietary applications. It is a header-only library and requires Boost 1.38 and higher to use. Consulting or support inquiries may be sent through this form.

Sunday, August 09, 2009

Elements of Programming: A Review (Part 2)

In Part 1 I covered three major foundational concepts being put forth by Alexander Stepanov and Paul McJones in Elements of Programming, their latest book published by Pearson Education. This post then attempts to apply the concepts in the book by putting forth some examples that were not lifted from the book, because it's already full of concrete samples and applications. In this installment in a three part series, I attempt to apply the generic programming foundational approach to building a simple range composition library.

After reading the book I got more and more interested in programming in the style that Generic Programming [1][2] prescribes and to apply the concepts being taught. The highlight of the elements of programming is the deliberate and evolutionary process in which algorithms are implemented to be both efficient and generic. However, to get to the generic part we will need to start with the specific implementation.

Concepts in Action

Even if we won't be having language level support for Concepts in C++0x, we can still enforce concept checking using the Boost Concept Check library. This library has been around for quite a while already and provides an easy means of defining concepts and enforcing them on your algorithms with the C++98 and C++0x. I've in-lined a simple example of how we can use Concepts in our programs:


#include <boost/concept_check.hpp>

template <class T>
struct Foo : boost::DefaultConstructible<t> {
private:
T t;
public:
BOOST_CONCEPT_USAGE(Foo)
{
T c(t); // T must be copy constructible
c.foo(); // T must have a nullary function foo
}
};

template <class T>
void call(T & t) {
BOOST_CONCEPT_ASSERT((Foo<t>));
t.foo();
}

struct a_foo {
void foo() { /* do nothing */ }
};

struct not_a_foo {
void foo_() { /* do nothing */ }
};

int main(int argc, char * argv[]) {
a_foo foo_thing;
not_a_foo bar;

call(foo_thing); // ok
call(bar); // not ok, will fail
}

In the above sample, what's interesting is that the code will not compile and that's by design. This example is also meant to show how much more readable error messages are with Concepts:


concepts_example.cpp: In function ‘void call(T&) [with T = not_a_foo]’:
concepts_example.cpp:34: instantiated from here
concepts_example.cpp:18: error: ‘struct not_a_foo’ has no member named ‘foo’
concepts_example.cpp: In destructor ‘Foo<t>::~Foo() [with T = not_a_foo]’:
/home/dean/boost/boost/concept/detail/general.hpp:38: instantiated from ‘static void boost::concepts::requirement<boost::concepts::failed************>::failed() [with Model = Foo<not_a_foo>]’
concepts_example.cpp:17: instantiated from ‘void call(T&) [with T = not_a_foo]’
concepts_example.cpp:34: instantiated from here
concepts_example.cpp:11: error: ‘struct not_a_foo’ has no member named ‘foo’

This is something that's worth paying for (and having) if you write a lot of code that depends on concepts -- and if you intend to write generic algorithms. From here on I'll make use of implementations that expressly use concepts implemented using the Boost Concept Check Library (BCCL).

Ranges

One of the main theses of the book is that generic algorithms build upon basic Concepts and implement generic enough abstractions to make implementations re-usable and powerful in their own rights. The book talks a lot about the Iterator concept hierarchy in the Standard Template Library (STL) and how you can write algorithms that build upon these Iterator concepts and some "refinements" (or more narrow specific types of Iterator concepts).

Briefly the book also covers a topic on Ranges that are represented as a pair of Iterators. Looking at this concept of a Range and the thoughts shared lately by Andrei Alexandrescu in his keynote address [3] in the recently concluded BoostCon 2009, I'm inclined to think that there is merit in what Andrei has been proposing -- that there may be something to be gained by making Ranges the primitive Concept to deal with when implementing generic algorithms that deal with a collection of data. Having said this, let me give you a short discourse into how you can implement this using a similar approach shown in the book.

First let us define a simple concept we want to deal with: a Range. A Range allows us to do a finite number of things to it:

  • Pop the element at the front of the range (pop_front(range)).
  • Get the element at the front of the range (front(range)).
  • Check if a range is empty (empty(range)).


To put this in code, we need a Range concept that we should be able to check programmatically. The following listing shows a candidate definition of the minimal Range concept using BCCL:


template <class R>
struct Range : boost::DefaultConstructible<R>, boost::EqualityComparable<R> {
private:
R r;
typedef typename R::value_type value_type;
public:
BOOST_CONCEPT_USAGE(Range)
{
R c(r); // R must be copy constructible
pop_front(c); // support a pop_front action
value_type ref = front(c); // support a front action
bool result = empty(c); // check if a range is empty
}
};


In the code above we define the syntactic requirements for a type R to model the Range Concept. We then implement a very simple range, a counted_range which defines the numeric range from [0,max).


struct counted_range {
counted_range(size_t max = 0) : count_(0), max_(max) {}
counted_range(counted_range const & other) : count_(other.count_), max_(other.max_) {}
bool operator==(counted_range const & rhs) {
return ((rhs.count_ == count_) && (rhs.max_ == max_));
}
bool operator!=(counted_range const & rhs) {
return !(*this == rhs);
}
typedef size_t value_type;
private:
friend void pop_front(counted_range &);
friend size_t front(counted_range &);
friend bool empty(counted_range &);
size_t count_, max_;
};

void pop_front(counted_range & range) {
++range.count_;
}

size_t front(counted_range & range) {
return range.count_;
}

bool empty(counted_range & range) {
return range.count_ >= range.max_;
}


Note in the implementation that we didn't define member functions but rather relied upon free functions -- this is by design because our Range Concept definition requires free functions pop_front, empty, and front. Now that we have a simple definition of a counted_range all we need to do is write an algorithm that actually requires the Range concept. Let's see a simple algorithm named count which will just count the number of elements in a Range.


template <class R>
BOOST_CONCEPT_REQUIRES(
((Range<R>)),
(size_t)
)
count(R range) {
size_t result = 0;
while (!empty(range)) {
pop_front(range);
++result;
}
return result;
}


We can see here that we have a function count that takes as a parameter an object range of type R which models the Range Concept and returns a value of type size_t. Note that we're dealing with a copy of the range, which allows us to mutate the range within the algorithm without having to deal with the original range.

We can then call the count algorithm a regular function -- one that will return the same value given the same input. Because it is a regular function, we can then be sure that it will return the same value given a range of the same properties, and then use it in our higher level functions and be sure that we can count on count (pardon the pun) to do the right thing every time it is called. Our only worry is that the codomain of the function is the set of values that can be represented with the size_t type. That means, if you have a Range type that represents a range that has more than the maximum value that a size_t can handle then either you have an invalid Range type or count is ill-defined for that input.

We then have two choices here: 1) we require the Range to support a distance_type nested class which defines the type for the distance between the first and last element of a Range or 2) we refine the Range concept with another concept called a FiniteRange which defines a finite range (meaning the number of elements in the range can be counted) and has the distance_type part of each type that models the FiniteRange Concept.

I like choice 2 better because we can then have two more specific kinds of Ranges -- an InfiniteRange and a FiniteRange. We can then have classifications of FiniteRange concepts that refine further the FiniteRange type and possibly see ForwardRange, BidirectionalRange, RandomAccessRange, InputRange, etc.; I can then also imagine another set of InfiniteRange concepts that refine InifiniteRange and see a GeneratorRange, OutputRange, IdentityRange and others like so.

Further Thoughts

What I attempted to show here without quoting entries from the book (which I definitely encourage everyone to get at least a copy of) is that using the techniques of being able to understand the relationship between concrete classes and the notion of Concepts allows us to approach problem solving in a more formal and generic manner. Being able to put requirements on the types of the parameters of functions and understanding the limitations and capabilities of our algorithms gives us better foundations to write solutions on top of.

This means we should be able to reason about our solutions and enforce the reasoning with the help of libraries like the BCCL. Because we are then able to reason about our solution we can build better software with more confidence that we actually know what is going on. Combine this with the generic programming approach of working from specifics to generics and you then have a method of building libraries and solutions that will withstand the test of time.

Adopting a mindset of understanding the requirements on inputs of a function and also the implications of the output of a function gives us an understanding of the limitations and applicability of our solutions on a broad range of problems. Maybe we have an algorithm that will be efficient for a certain class of Concepts and better algorithms that will be more efficient with more refined classes of Concepts. Using both the Object Oriented nature along with the Functional and Generic nature of C++, we can then have a better chance of writing algorithms, programs, solutions, and applications that we can be confident about using equational reasoning.

Part 3

In Part 3 I will attempt to construct a logical thesis that will show how to approach application development using the concepts put forth in the book. In the next (and last) installment of my review of the book I will put together an example that uses the BCCL, Boost.Test, and Boost.Asio.

References

[1] Generic Programming : Wikipedia Definition

[2] generic-programming.org

[3] BoostCon 2009: Iterators Must Go!

This is part 2 of a three-part series on the review of the Elements of Programming book by Alexander Stepanov and Paul McJones. Comments, suggestions, and questions are welcome in the comments section. In case you prefer email, send them to the author through me@deanberris.com or dean.berris@cplusplus-soup.com.

Tuesday, August 04, 2009

Elements of Programming: A Review (Part 1)

Over the course of the past few months much has happened to me in my life. Besides getting married to the love of my life (twice) and moving to a new job, I've received a review copy of the recently released Elements of Programming by Alexander Stepanov and Paul McJones. What follows is my take on what I think of the book and why everyone -- and I say absolutely everyone -- who has been doing software engineering for quite a while already should get a copy of this book.


I personally think this book is as important to the field of software engineering (and programming in general) as how the Principia Mathematica was to mathematics. This is the book which will give those who care about the craft of software engineering and problem solving with software the fundamental insight into how libraries and software packages meant to deal with real-life projects should be written. Consequently after reading this book you'd want to look at all your source code you ever wrote before and see how you can apply the concepts being taught in the book.

This is how valuable I think this book is: I'll actually get multiple copies of the book as foundational reading for my team -- and pay for the copies myself if I have to. This is one of those books that I've come across where I think the price I pay to have a copy of it is definitely worth every penny if not even more.

Enough of what I think; now I want to talk about what the book contains and why I'm raving about the book so much it's taken three paragraphs of this entry for me to get to this point. But first, let me give some "prerequisites":

  • If you haven't been programming in C++ before, get into it now if only to be able to understand what the principles and techniques being taught are. It's not a prerequisite that you are an expert in C++, but if you understand how to use templates in your own code that would be great.
  • If you didn't have a major in mathematics, you might want to keep an elementary algebra and discrete mathematics reference around. Having been a computer science major at one point in my life, I can now appreciate the application of Set Theory and Group Theory to practical software development. It would really help if you have solid mathematical background.
  • If you think this will be an easy read, adjust your expectations now. It's going to be really hard to just skim through without thinking and focusing. Remember how you read (or tried to read) your high school textbooks to prepare for an exam? Well I do, and I'd recommend you use the same focus and techniques to read and digest what this book is all about.
  • Keep a notebook close so that you can take notes about how your brain is changing the way you look at software engineering. I'm also recommending that you keep a notebook handy also so that you can paraphrase what you're thinking while going through the book and just keep track of the thoughts.
One general word of caution: if you think the concepts are going way over your head, go back to the basics and the fundamentals. This is one general rule that you'd like to go back to when reading this book. If you don't understand one part, do your research and go back to the material -- it generally helps if you have time to actually digest it and focus as you read it.

First, the general style of the book is that of a textbook. There are no anecdotes and personal notes from the authors -- it feels like you're reading a collection of papers and building up a very grand case for the techniques and concepts to be shared. Alexander and Paul do a great job of keeping the material terse and yet be expressive enough to keep a reader engaged.

Second, the subject mater and tone is serious. This book does a great job of being timeless in its treatment of the subject. This is as clear cut as you'd want to get when you're dealing with the fundamentals of programming and software engineering. The authors keep it very serious and very "matter-of-fact" throughout the book and it drives the point that this book is definitely something that everyone should be able to understand and benefit from. They keep to facts and the subject very effectively without being clever about anything.

Third, the C++ they use is just the subset of the whole language; if C++0x would contain Concepts (which it won't as of the most recent developments) then the code in the book are the definitive examples of how to use the proposed feature set. Having said this, the authors provide an appendix that deals with the syntax used and the features you can leverage in C++ to make the code compile without Concepts support from the compiler.

Having done the preliminaries, I'll go into some of the details in the book. In part 2 I will attempt to apply the concepts in the book, while in part 3 I will attempt to construct a logical thesis on how to approach application development using the techniques taught in the book. In this part I will highlight some of the important foundational concepts put forth by the book in three sections: Concepts, Composition, and Generalization.

Concepts

Although C++0x won't contain Concepts, the book starts out with laying the foundations of types and values then graduates to Concepts. After laying the groundwork down for the foundations, the concept of Concepts wherein you describe a meta-class (a class of Types) that abide by a set of rigid qualifications are then described. Essentially, they introduce the concept of a Concept by looking at data, classes (types), and then by creating a generalization of types that model a Concept.

This clear treatment and discussion of Concepts is one of the most if not the only definitive text I've come across that deals with the subject matter. Short of the book on Generic Programming, the concise and matter-of-fact approach to describing and defining Concepts is the best I've read bar none.

What the whole book then depends upon is the notion that a higher level (or more generic) treatment of classes instead of dealing with a hierarchical view of types (as in Object Oriented Programming) by defining the requirements on a type -- that it supports certain operations applied to the type, that there are certain axioms that hold against a type, and that there are consistent complexity requirements on the algorithms that deal with instances of the type -- that we can write generic, efficient, and effective algorithms relying on types modelling Concepts. The Type Theory aspects relevant to generic programming become evident if we draw parallels between Haskell Type Classes and C++ Concepts; we can then begin to see how transformational and important Concepts can be for sufficiently complex software engineering projects.

Composition

One of the big things that the book promotes is the approach to composition: that higher level or higher order functions can be (and most probably should be) composed of smaller specific functions. The type of composition is not very far from the functional programming concept of composition, but rather the notion that you solve bigger problems by solving other problems with relatively smaller solutions.

The kind of composition being promoted by the book is the generic kind where once you define and operate with Concepts, that the algorithms that you compose are then generic enough to be applied to any instance of a type that models the Concepts the algorithm was meant to work with. This is very powerful because this means that if you have an algorithm that deals with a broad range of types instead of just types that derive from a certain base, this increases the re-use value for the algorithms you implement (or the library you've written).

Much like how bottom-up programming is meant to make complex tasks easier to solve by solving smaller tasks first, the composition approach applies very well to complex application development projects. The book gives you a good and clear idea of how to go about doing this in your own project albeit in a subtle manner that lets you formulate the solutions by showing the techniques and allowing you to run away with it.

Generalization

What the book doesn't do is come straight at you with generalizations, but rather it deals with foundations and specifics first before generalizing. The reasoning behind this is like with Set or Group Theory, you can only extend to a greater set by defining rules and building upon predicates that apply to an individual. What the book teaches is that you first go specific then get generic progressively.

The book also doesn't first show you what the generic "formula" or "technique" is, but rather it deals with a specific area first and the solutions that pertain to that area and then extends it to a more generic setting. By building on previously developed concepts and techniques readers get a very good example of how to correctly generalize -- by using mathematical logic and foundations and different techniques of proof at an abstract level to reason about the solution.

Just like how the math and computer science majors are taught in discrete mathematics, your generalization should always apply true for the specific cases covered by the generalization. Unless you can prove that your generalization is logically sound, then you don't have an assurance that your generalization is correct -- therefore it would make your generalization invalid. Knowing this and applying it to software engineering is nothing short of brilliant.

Parting Shots

So if you've read through this (lengthy) post up to here, I'll say it once more: get this book if you're looking to improve the way you do software engineering and programming. If you care for the art/science of programming and problem solving with programming languages, pick this book up and I'm positive you will come to the same conclusions as I've reached. It took me a while to write this because mainly I wanted to do the book justice by doing an in-depth look and understanding how important this book is to the field.

More to come in the next parts where I attempt to apply the concepts and techniques promoted by the book in real-world application development projects.

The author would like to thank Pearson Education for sending a copy of this very important piece of literature for review. For questions and suggestions, please post in the comments or send email to me@deanberris.com or dean.berris@cplusplus-soup.com.

Update: Minor typographical edits from "discreet" to "discrete" mathematics.

Monday, June 08, 2009

Bloom Filters

During the weekend I was fascinated by a simple utility called a Bloom Filter. From the Wikipedia entry:

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
So I tried implementing my own generic Bloom Filter in C++ and submitted it to the Boost C++ Libraries. My email went a little something like this:

During the weekend, I got acquainted with and excited about Bloom
Filters and the utility of such a data structure. I then proceeded to
look at generic implementations of a bloom filter and I thought about
implementing one myself instead as a simple exercise.

Attached is what I came up with which I'm submitting as a library for
review (and uploading to the vault). Also attached is a sample program
which uses the bloom_filter.

This then led to a few responses, and now I'm proud to say that my implementation can now be checked out and tracked from Boost Sandbox! The URL to the subversion repository is:
https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/
Have fun with the library, and I would definitely like to know what you think about it!