Sunday, March 15, 2009

GCC C++0x Features Exploration

In the interest of exploration, I played around one Sunday afternoon with the C++0x features available from GCC 4.3.2 [1] -- to better see what I can use at work when I migrate the application I'm writing from the C++98 standard to the C++0x standard. To gauge the differences, I tried implementing a simple stream-fusion approach to simultaneously compute the the mean, minimum, and maximum in a collection of integers.

For a little background, let me try and describe what stream fusion is briefly. Simply:




Stream Fusion

Technique of removing multiple loops over a container and performing the required computations in a single pass. This technique is meant to minimize the number of loop iterations hopefully to improve the performance of an application. [2]

Stream fusion is a good way to improve performance in applications that deal with huge containers and that perform multiple passes on the container to do multiple things. Sometimes the technique is a little more tideous to apply than the naive solutions but when performance matters it's something you can try.

Let's look at a naive version of getting the mean, minimum, and maximum of
an array of integers:


 
int numbers[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int min = min_element(numbers, numbers + 10);
int max = max_element(numbers, numbers + 10);
double mean =
static_cast<double>(accumulate(numbers, numbers + 10))
/ 10.0
;

Here, we have three loops that compute the mean, minimum, and maximum from
the same container. From the way it looks, it's just fine that we have three loops since we just have 10 elements in the array.

But what if for example we were reading the integers from a file and we have millions of integers to deal with, then going over the file three times doesn't seem like a viable solution especially if performance matters. At that scale of the problem, you can count on performance being a big deal. It also doesn't make sense for us to keep everything in memory either because the data may not fit in memory or the data actually comes from an input
iterator bound to a stream.

Stream fusion then becomes more obvious because the reason it's called stream fusion is because usually the computations are associated with going through data that comes from a stream. Going "back" in a stream usually doesn't make sense or performance-wise is very costly, so we try and keep
the computation going as we consume items in the stream to save on space consumption and to get the performance enhancements.


Stream Fusion, C++98 Style


So first thing we try is to do stream fusion with the C++98 standard. We'll use the TR1 tuple implementation that comes with GCC and the standard algorithms as well as a special computation type we'll callmean_min_max. First, we'll think about how to simultaneously compute the mean, minimum, and maximum.

To do this, we'll have to be able to keep track of the following variables:




  • minimum so far : we'll call this MIN

  • maximum so far : we'll call this MAX

  • total so far : we'll call this ACC

  • count of elements dealt with so far : we'll call this COUNT



What we want to do is for each iteration of the loop, we'll look at one element and see if it's the minimum so far, maximum so far, add it to ACC, and increment COUNT. To do this in functional programming means we'll have to rely on a left-fold and make mean_min_max a binary function object taking a state variable and an element as the respective arguments to the function operator overload.

To make it as generic as possible, we implement mean_min_max as a polymorphic function object type. We come up with the C++98 compliant version of mean_min_max below:


 
struct mean_min_max {
template <class T, class U>
T operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return make_tuple(
// This is the running mean...
static_cast<double>(get<ACC>(state) + element)
/ static_cast<double>(get<COUNT>(state) + 1),

// This is the minimum so far
(get<COUNT>(state) == 0)?
element
: (get<MIN>(state) > element)?
element
: get<MIN>(state),

// This is the maximum so far
(get<COUNT>(state) == 0)?
element
: (get<MAX>(state) < element)?
element
: get<MAX>(state),

// This is the new value of ACC
get<ACC>(state)+element,

// This is the new value of the COUNT
get<COUNT>(state) + 1
);
}
};

We can then use this as the fourth argument to the call to accumulate while we use the tuple (0.0, 0, 0, 0uL, 0) as the initial state of the left fold:


 
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
make_tuple(mean, min, max, 0uL, 0),
mean_min_max()
);

If we notice the call to accumulate we'll see that there's nothing special about this construct. Sure, we've accomplished what we want to accomplish with just the C++98 standard. You might be wondering what the problem with the above implementation could be, and in the next section I'll detail what those issues are.


Stream Fusion, C++0x Style


Now in the above solution, we have quite a number of inefficiencies that aren't readily obvious. First of these are the inordinate number of copies we'll be doing when returning the tuple from mean_min_max. In C++98 we're stuck with copies when returning data from functions. Unless we use something like an auto_ptr<...> as a return and use the heap, we don't have move semantics.

But in C++0x, we now have support for move semantics as an addition to the language. We rely on two features: r-value references and the move library function. So first, let's talk a little about what r-value references are and why they're important.




R-Value References

R-value references are denoted by && right after a type. This is used to denote that only temporary values and literal constants can be referred to by an r-value reference. These r-value references actually take on the storage of the temporary value thereby not needing to allocate new storage for a new variable. It is then said that the temporary is 'moved' from the original scope to the new scope without relying on copies and additional temporaries. [3]



With a C++0x implementation of the mean_min_max we can change a few things that would allow the solution to minimize on copies:


 
struct mean_min_max {
template <class T, class U>
T && operator() (T const & state, U const & element) const {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
};

Here we change the result of the function call operator to be an r-value reference of type T and that we should move the temporary created in the body to be returned into the storage of wherever the result is to be put. Instead of the compiler calling the copy constructor of the tuple, it then relies on the move constructor (or a constructor that takes a T && as
the sole argument instead of a T const &) to initialize the values of the appropriate members.

If you have enough iterations and are dealing with heavy objects, r-value reference returns and move semantics will save you a lot of unnecessary copies. The less copies you make, the better the performance you get.

Another chance where we can use this technique is at the call to accumulate where we move the initial temporary tuple instead of letting the compiler copy the tuple for us:


 
double mean = 0.0;
int min = 0, max = 0;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(
make_tuple(mean, min, max, 0uL, 0)
),
mean_min_max()
);

GCC and C++0x Lambda's


Unfortunately, GCC 4.3.2 doesn't yet support C++0x Lambda's [4]. I haven't checked yet if GCC 4.4.x [5] already supports it, but since lambda's aren't supported by the compiler I was testing, I'm not yet able to show a solution to the problem without testing it. However, if your compiler already supports the C++0x lambda's, you can go ahead and move body of the mean_min_max function into a lambda like this:


 
double mean = 0.0;
int min = 0, max = 0;
typedef tuple<double, int, int, unsigned long, size_t> accumulator_tuple;
tie(mean, min, max, ignore, ignore) =
accumulate(numbers, numbers + 10,
move(make_tuple(mean, min, max, 0uL, 0)),
accumulator_tuple &&
[](accumulator_tuple & const state, int const & element) {
enum { MEAN, MIN, MAX, ACC, COUNT };
return move(
make_tuple(
// same as before
)
);
}
);

If you're just making throw-away functions, lambda's are an attractive approach to keep the context of the computation closer to the point where you need the computation to be done. This is great if you are adept at using the STL algorithms and are comfortable with functional programming principles.

In this example the Lambda's noticeably don't have parametric types in the construction. This is because C++0x are monomorphic -- which means they only have one form. This makes C++0x lambda's a little less flexible compared to hand-coded function objects that have template member overloads of the function call operator (like our version).


More to Come


In this article I've highlighted important new features of C++0x which should allow developers to be more productive and be able to produce better performing code with R-value references, move semantics, and lambda's. There are a handful more new features upcoming at the language level which I would like to explore as well in upcoming articles.

All in all, C++0x allows you to write powerful and performant code that when combined with powerful idioms (like stream fusion) yields superior solutions. Even though the C++98 standard allows you to write good code, C++0x just allows you to write better code in more ways than one.







[1]GCC 4.3.2 http://gcc.gnu.org/gcc-4.3/






[2]Stream Fusion http://www.cse.unsw.edu.au/~dons/papers/CLS07.html






[3]R-value References http://www.open-std.org/JTC1/sc22/wg21/docs/papers/2004/n1690.html






[4]Lambda's http://en.wikipedia.org/wiki/C%2B%2B0x






[5]GCC 4.4.x http://gcc.gnu.org/gcc-4.4/changes.html

13 comments:

IDisposable said...

I would really like to see example-code that is best-practice.

In this case, why on Earth are you not using any temporaries in the body of the make_tuple method for the return values of all those get<XXX> calls?

Since the whole concept of this stream-merge is for the sake of efficiency, why not be explicit for the compiler and avoid multiple method calls?

weima said...

hi,

i wish i was as unenthusiastic about C++ as you are... :)
thanks for the post.

regards,

CPP09 said...

For Lambda:
Check the following,
ICC v 11 or above;
VS 2010
GCC: lambda-branch;

Dean Michael said...

Hi IDisposable,

If you knew what get<XXX> did, you wouldn't need temporaries. What get<XXX> returns are references, not copies. So if you used temporaries, those would be needless copies that wouldn't make sense in a performance context.

make_tuple will deduce the types of the arguments, and create a tuple that would be initialized from the references passed in as the result of the get<XXX> function calls. In reality, what will happen is the get<XXX> function calls reduce to accessing references.

Thanks for dropping a line. :)

Dean Michael said...

Thanks CPP09,

If you notice, I was using GCC 4.3 -- and the whole post was about using GCC. In the future, I'll try the latest and greatest compilers and test out more C++0x features. :-)

Thanks for the heads up! :-D

Allister said...

Hi Dean,
I still don't quite get these r-value references but they do look very useful. But thanks for the post. You the man! :)

SG said...

Hi!

Your interpretation of the Rvalue Reference proposal is wrong. Rvalue references don't "take on the storage of the temporary". They are normal references. You proposed returning an rvalue reference to a function-local object. This reference will be a dangling reference.

YOU> "This is used to denote that only temporary values and literal constants can be referred to by an r-value reference"

That's wrong. The original proposal allowed rvalue references to bind to lvalues as well. But it turned out to be dangerous which is why the proposal has been updated (see document N2844 from the march 2009 mailings).

Cheers!
SG

Dean Michael said...

Hi SG,

The reason you can actually return a function-local temporary is because you ought to use the 'move' function call -- which moves the temporary into the storage of whatever was going to contain it or copy from it.

The idea is:

int && i = 2;

The literal value '2' has its own storage, and you get an rvalue reference to that storage. In the following:

movable_type temp = move(movable_type());

You actually move the temporary into the storage of temp, and thus you get space saving and save your application from an extra copy. Think:

int i = move(int(2));

And thanks for the pointer, rvalue references should not be able to bind to lvalue references anymore.

Thanks for dropping a line!

pizer said...

You declared a function that returns an rvalue *reference* to a *function local* object.

T && operator() (T const&...

You need to replace the rvalue reference "T&&" in the return type with a normal "T". Otherwise you get undefined behaviour (dangling reference!). Rule of thumb: Never ever return rvalue references. The only two meaningful functions that do so are std::forward and std::move.

Cheers!
P

Mathias Gaunard said...

Your code is completely incorrect. You are returning a rvalue reference to a temporary. Replace T&& by T.

Also, remove all those explicit move calls. It seems you don't get the point of rvalue references at all. They were introduced to differentiate lvalues from rvalues. The move function only casts an lvalue to an rvalue.
Calling move on an rvalue is useless, since rvalues are already recognized as such.

kiapura said...

I'm sorry Michael, you're interpretation of rvalue references is incorrect for even more reasons than these guys pointed out.

There is no "move constructor" on a tuple, a simple logical deduction would lead you to realise that there is no more efficient way of "moving" a tuple than copying it. Moving is mainly useful when a pointer to allocated memory can be copied between the objects (as you know the pointer member of the RHS is no longer necessary).

kiapura said...

I should correct what I said and point out that a move constructor on a tuple would be beneficial in the case where one of the tuple elements is movable, but in your example this isn't the case.

Candy said...

why not

double mean = 0.0;
int meansum = 0;
int meancount = 0;
int min = 0, max = 0;

std::accumulate(numbers, numbers + 10, 0,
[&](int, int const &value){
if (min > value) min = value;
if (max < value) max = value;
meansum += value;
meancount++;
});
if (meancount)
mean = (double)meansum / meancount;

or, if you can have static variables in a lambda (not sure about that):

double mean = 0.0;
int min = 0, max = 0;

std::accumulate(numbers, numbers + 10, 0,
[&](int, int const &value){
static int meansum = 0;
static int meancount = 0;
if (min > value) min = value;
if (max < value) max = value;
meansum += value;
meancount++;
mean = (double)meansum / meancount;
});