In the interest of exploration, I played around one Sunday afternoon with the C++0x features available from GCC 4.3.2 -- to better see what I can use at work when I migrate the application I'm writing from the C++98 standard to the C++0x standard. To gauge the differences, I tried implementing a simple stream-fusion approach to simultaneously compute the the mean, minimum, and maximum in a collection of integers.
For a little background, let me try and describe what stream fusion is briefly. Simply:
- Stream Fusion
- Technique of removing multiple loops over a container and performing the required computations in a single pass. This technique is meant to minimize the number of loop iterations hopefully to improve the performance of an application.
Stream fusion is a good way to improve performance in applications that deal with huge containers and that perform multiple passes on the container to do multiple things. Sometimes the technique is a little more tideous to apply than the naive solutions but when performance matters it's something you can try.
Let's look at a naive version of getting the mean, minimum, and maximum of
an array of integers:
int numbers[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int min = min_element(numbers, numbers + 10);
int max = max_element(numbers, numbers + 10);
double mean =
static_cast<double>(accumulate(numbers, numbers + 10))
/ 10.0
;
Here, we have three loops that compute the mean, minimum, and maximum from
the same container. From the way it looks, it's just fine that we have three loops since we just have 10 elements in the array.
But what if for example we were reading the integers from a file and we have millions of integers to deal with, then going over the file three times doesn't seem like a viable solution especially if performance matters. At that scale of the problem, you can count on performance being a big deal. It also doesn't make sense for us to keep everything in memory either because the data may not fit in memory or the data actually comes from an input
iterator bound to a stream.
Stream fusion then becomes more obvious because the reason it's called stream fusion is because usually the computations are associated with going through data that comes from a stream. Going "back" in a stream usually doesn't make sense or performance-wise is very costly, so we try and keep
the computation going as we consume items in the stream to save on space consumption and to get the performance enhancements.
So first thing we try is to do stream fusion with the C++98 standard. We'll use the TR1 tuple implementation that comes with GCC and the standard algorithms as well as a special computation type we'll callmean_min_max. First, we'll think about how to simultaneously compute the mean, minimum, and maximum.
To do this, we'll have to be able to keep track of the following variables:
- minimum so far : we'll call this MIN
- maximum so far : we'll call this MAX
- total so far : we'll call this ACC
- count of elements dealt with so far : we'll call this COUNT
What we want to do is for each iteration of the loop, we'll look at one element and see if it's the minimum so far, maximum so far, add it to ACC, and increment COUNT. To do this in functional programming means we'll have to rely on a left-fold and make mean_min_max a binary function object taking a state variable and an element as the respective arguments to the function operator overload.
To make it as generic as possible, we implement mean_min_max as a polymorphic function object type. We come up with the C++98 compliant version of mean_min_max below:
struct mean_min_max {
template <class T, class U>
T operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return make_tuple(
// This is the running mean...
static_cast<double>(get<ACC>(state) + element)
/ static_cast<double>(get<COUNT>(state) + 1),
// This is the minimum so far
(get<COUNT>(state) == 0)?
element
: (get<MIN>(state) > element)?
element
: get<MIN>(state),
// This is the maximum so far
(get<COUNT>(state) == 0)?
element
: (get<MAX>(state) < element)?
element
: get<MAX>(state),
// This is the new value of ACC
get<ACC>(state)+element,
// This is the new value of the COUNT
get<COUNT>(state) + 1
);
}
};
We can then use this as the fourth argument to the call to accumulate while we use the tuple (0.0, 0, 0, 0uL, 0) as the initial state of the left fold:
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
make_tuple(mean, min, max, 0uL, 0),
mean_min_max()
);
If we notice the call to accumulate we'll see that there's nothing special about this construct. Sure, we've accomplished what we want to accomplish with just the C++98 standard. You might be wondering what the problem with the above implementation could be, and in the next section I'll detail what those issues are.
Now in the above solution, we have quite a number of inefficiencies that aren't readily obvious. First of these are the inordinate number of copies we'll be doing when returning the tuple from mean_min_max. In C++98 we're stuck with copies when returning data from functions. Unless we use something like an auto_ptr<...> as a return and use the heap, we don't have move semantics.
But in C++0x, we now have support for move semantics as an addition to the language. We rely on two features: r-value references and the move library function. So first, let's talk a little about what r-value references are and why they're important.
- R-Value References
- R-value references are denoted by && right after a type. This is used to denote that only temporary values and literal constants can be referred to by an r-value reference. These r-value references actually take on the storage of the temporary value thereby not needing to allocate new storage for a new variable. It is then said that the temporary is 'moved' from the original scope to the new scope without relying on copies and additional temporaries.
With a C++0x implementation of the mean_min_max we can change a few things that would allow the solution to minimize on copies:
struct mean_min_max {
template <class T, class U>
T && operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
};
Here we change the result of the function call operator to be an r-value reference of type T and that we should move the temporary created in the body to be returned into the storage of wherever the result is to be put. Instead of the compiler calling the copy constructor of the tuple, it then relies on the move constructor (or a constructor that takes a T && as
the sole argument instead of a T const &) to initialize the values of the appropriate members.
If you have enough iterations and are dealing with heavy objects, r-value reference returns and move semantics will save you a lot of unnecessary copies. The less copies you make, the better the performance you get.
Another chance where we can use this technique is at the call to accumulate where we move the initial temporary tuple instead of letting the compiler copy the tuple for us:
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(
make_tuple(mean, min, max, 0uL, 0)
),
mean_min_max()
);
Unfortunately, GCC 4.3.2 doesn't yet support C++0x Lambda's . I haven't checked yet if GCC 4.4.x already supports it, but since lambda's aren't supported by the compiler I was testing, I'm not yet able to show a solution to the problem without testing it. However, if your compiler already supports the C++0x lambda's, you can go ahead and move body of the mean_min_max function into a lambda like this:
double mean = 0.0;
int min = 0, max = 0;
typedef tuple<double, int, int, unsigned long, size_t> accumulator_tuple;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(make_tuple(mean, min, max, 0uL, 0)),
accumulator_tuple &&
[](accumulator_tuple & const state, int const & element) {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
);
If you're just making throw-away functions, lambda's are an attractive approach to keep the context of the computation closer to the point where you need the computation to be done. This is great if you are adept at using the STL algorithms and are comfortable with functional programming principles.
In this example the Lambda's noticeably don't have parametric types in the construction. This is because C++0x are monomorphic -- which means they only have one form. This makes C++0x lambda's a little less flexible compared to hand-coded function objects that have template member overloads of the function call operator (like our version).
In this article I've highlighted important new features of C++0x which should allow developers to be more productive and be able to produce better performing code with R-value references, move semantics, and lambda's. There are a handful more new features upcoming at the language level which I would like to explore as well in upcoming articles.
All in all, C++0x allows you to write powerful and performant code that when combined with powerful idioms (like stream fusion) yields superior solutions. Even though the C++98 standard allows you to write good code, C++0x just allows you to write better code in more ways than one.