Back to C++ Optimization Techniques
4: C++ Final Optimizations
Your application is up and running. The data structures are ideal, the algorithms sublime, the code elegant, but the app - well, it's not quite living up to its potential. Time to get drastic, and with drastic measures, there are tradeoffs to consider. These optimizations are going to make your code less modular, harder to understand, and more difficult to maintain. They may cause unexpected side effects like code bloat. Your compiler may not even be able to handle some of the more advanced template-based techniques. Proceed with caution. Arm yourself with a good profiler.
4.1: Inline Functions
Theinline keyword is an extremely simple and powerful means of optimizing C++ programs. In one oft-cited case, the C++ standard library sort ran 7 times faster than qsort on a test of 500,000 elements because the C++ version was inlined. On the other hand, inline functions are also overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware! Inline functions are not free. They can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Sample code and relative performance:
The biggest gain is inlining the function that returns the complex value. Inlining that function increased performance over two times the non-inlined version. Inlining the larger functions or the functions that returned non-trivial objects (strings, bitmaps) did not improve performance at all.
Note that using Microsoft's inline "any suitable" function option did not inline any other functions other than those already marked inline, even the world's most simple function a()! Clearly, there's room for improvement in the Visual C++ compiler.
4.2: Avoid Temporary Objects: the Return Value Optimization
C++ is a powerful language, but the power behind programmer-defined objects and guaranteed object initialization and destruction is not without a price. In many cases, the price you pay is what's called "hidden temporary objects." C++ merrily builds and destroys objects in many non-obvious places. Some of the performance techniques already listed above are methods of avoiding temporaries. For instance, passing objects by reference instead of by value avoids temporaries. Using ++x instead of x++ avoids temporaries. Using theexplicit keyword helps avoid hidden temporaries.
Here's one more tip for avoiding temporaries. It's called return value optimization. The best way of showing how it works is through an example. Mult multiplies two complex values and returns the result.
The code is correct, but it could be more efficient. We already know we can improve the efficiency by initializing the complex value when it's constructed:
Now let's take it one step further:
At this point, the compiler can work a little magic. It can omit creating the temporary object that holds the function return value, because that object is unnamed. It can construct the object defined by the return expression inside the memory of the object that is receiving the result.
Return value optimization is another way of saying return constructor arguments instead of named objects. What kind of gains can you expect from this optimization? Here are some trivial example functions I used to evaluate potential performance improvements.
The results are favorable. String performance doubled and bitmap performance more than tripled. Before you go using this trick willy-nilly, be aware of the drawbacks. It's hard to check for errors in intermediate results. In the first version of Mult above, we could easily add error checking on the real and imaginary values. In the final version, all we can do is hope there's no overflow or underflow errors. The first version is also much easier to read. In the final version, there could be a non-obvious error. For instance, is the real part the first parameter to the complex constructor, or is the imaginary part?
One more note. The final version of the C++ standard has made it easier for compilers to optimize away even named objects. The standard says that for functions with a class return type, if the return statement is the name of a local object, and the object is the same type as the return type, the compiler can omit creating a temporary object to hold the function return value. That means that in some rosy future years away, return value optimization will be in the hands of compiler writers where it should be, and we won't have to change our code. And pigs could fly. In the meantime, this tip is worth considering.
4.3: Be Aware of the Cost of Virtual Functions
Much has been made about the overhead of virtual functions. A single virtual function in a class requires that every class object and derived objects contain an extra pointer. For simple classes, this can as much as double the size of the object. Creating virtual objects costs more than creating non-virtual objects, because the virtual function table must be initialized. And it takes slightly longer to call virtual functions, because of the additional level of indirection.
However, when you really need virtual functions, you're unlikely to develop a faster mechanism than the virtual function dispatching built into C++. Think about it. If you needed to track an object's type, you'd have to store a type flag in the class, adding to the size of the object. To duplicate virtual functions, you'd probably have a switch statement on the flag, and the virtual function table is faster - not to mention less error prone and much more flexible - than a switch statement.
The following code shows a worst case scenario, a tiny object with trivial functions. The Virtual object is 8 bytes, and the NonVirtual object is 4 bytes.
Construction/destruction time shows the performance penalty of initializing the virtual function table. (See the Microsoft-specifictip for reducing this penalty). Notice that the function call overhead is very minimal, even though the function itself hardly does anything. For larger functions, the overhead becomes even less significant.
4.4: Return Objects Via Reference Parameters
The cleanest way to return an object from a function is returning the object by value.
When you're writing a function, the above method is preferable about 99.9% of the time. You should almost never return an object by reference. That's why this tip is entitled return objects via reference parameters, not return object by reference.
When t goes out of scope, it's destroyed and its memory is returned to the stack. Any reference to the object is invalid after the function has returned! A good compiler will warn you about this type of mistake. You can, however, pass your return value as a non-const reference function parameter, and in some cases see improved performance.
You'll only see a performance improvement if the object you're passing in as the desired return value (byRef) is being reused and hasn't been simply declared immediately prior to calling in the function. In other words,
may be faster, becauset is being reused every time through the loop, and doesn't have to be reconstructed and redestroyed at each iteration, while
is exactly the same as returning the object by value, and even worse, doesn't lend itself to possible return value optimizations. There's another good reason to avoid this suggestion - it can make your code very hard to read.
The results above show the times spent within ByValue and ByReference. These numbers are slightly misleading, because they don't show any construction time for T in the ByReference case, and T certainly must be constructed somewhere prior to calling ByReference. Nevertheless, the results show that performance gains may be significant in limited cases.
4.5: Consider Per-class Allocation
One feature of C++ that shows the true power and flexibility of the language is the ability to overload new and delete at the class level. For some objects, this power can give you incredible speed improvements. To see what kind of performance improvements I could get, I derived my own string class from the standard string class and tried various allocation schemes. I wasn't too successful, namely because standard new and delete are already pretty dang fast, and because string class wasn't a really good choice. The objects that will see the most improvement are the objects you use in a specific way that can truly benefit from a custom allocation scheme.
I used an approach called the "memory pool" method. Using individual pools for memory allocation is beneficial because it improves locality and you can optimize knowing that all objects in the pool are the same size.
My new string class looked like this:
I tried some different types of pools. MemPool uses a heap block and a free list. StackPool uses a chunk of the stack for the pool. HeapPool uses one heap block and no free list.
The StackPool object is shown below.
The stack pool gave the best performance. It was faster than the default new and delete implementations by 20%. It was sorely limited, though. After a certain number of allocations, it would crash because there was no more room in the pool. StackPool isn't complex enough to say "OK everybody - out of the pool." Profile your code. If you find that new or delete is a bottleneck for certain specific objects, consider overloading new and delete. However, you might find your money is better invested in a memory management library that will improve your performance across the board.
4.6: Consider STL Allocators
STL containers provide a method similar to overloading new and delete that allow you to customize the allocation behavior of a container. Every container takes as a template parameter an "allocator" object. This template parameter defaults to the standard allocator provided by the STL implementation. In most cases, the standard allocator class uses new and delete internally. You can always write your own allocators if you think you can do a better job, or, more likely, you have a container that you use in such a way that lends itself to custom allocation.
The following class is a custom allocator object called MyAlloc. It is based on a standard allocator object and can be used in place of a standard allocator.
The key functions are allocate and deallocate, which perform the bulk of the work. If you compare these functions to the overloaded new and delete operators in the previous example, you'll see that they're very similar. Here's an example of using MyAlloc as the allocator for a list container. I use the HeapPool object mentioned above as the pool for this allocator.
To compare efficiency, I evaluated the speed of inserting items into a list using the default allocator and NHeapPoolAlloc.
The results are all over the place. Strangely enough, there was significant improvement for ints. On the other hand, string performance plummeted. Just goes to show how important it is to evaluate your "optimization" once it's in place.
4.7: The Empty Member Optimization
Most programmers are surprised to discover that empty classes still require memory. For instance, suppose we define an allocator-type class similar to STL allocators. This class doesn't contain any data - just a couple of functions.
If we declare this object, it requires one byte of memory (sizeof(Allocator) == 1), because the C++ standard requires that we be able to address its location in memory. If we declare this object within another class, the compiler byte-alignment settings come into play. The class below requires 8 bytes of memory if we're 4-byte aligned.
This storage requirement has serious ramifications. This list object requires twice the size it really needs. Fortunately, the C++ standard provides a workaround. It says that "a base class subobject of an empty class type may have zero size." In other words, if we derive our class from the empty class, the empty class overhead disappears. This is the empty member optimization. The class below is 4 bytes.
Deriving from the empty class is not really an ideal solution. There are some cases when it's no solution at all. Here's a better one. We can declare an internal data member derived from the empty class.
Now there's an additional level of indirection within the class itself (i.e. we have to use m_head.allocate notation instead of allocate()), but our list is still only 4 bytes large and we have all the advantages of the allocator object. A Watcom engineer reported that STL benchmarks ran 30% faster after their compiler team implemented the empty-base optimization.
4.8: Template Metaprogramming
Recently there's been a growing awareness of an unforeseen template capability - using templates as pre-compilers. By embedding a call to a template within itself, a programmer can force a compiler to recursively generate code. A good compiler can optimize away all the recursion, because all template parameters must be known at compile time. The result? Massive speed improvements. Here's a function which recursively generates the factorial of n.
Now look at a template-based version that does the same thing where n is passed as the template parameter.
And some example calls, including a call to a non-recursive version.
For this particular case, the template version is 130 times as fast as the non-template version. The Microsoft compiler optimizes away all of the template recursion, implementing the call as a single move instruction. Now that's the kind of optimization I like to see.
Nice results, but not a terribly useful function. The template parameter must be known at compile time, so it's not very flexible. For some things, though, that's not really a limitation. Suppose you wanted to have an inline sine table. You could use template metaprogramming to create a class that computed the sine of a number using a series expansion. I did just that. The code is complex, so I leave it as an exercise for the reader. (Actually, it's in the accompanying TemplateMetaprogramming source file). I compared my template-based sine to the C runtime version, a table-based function, and a non-template function that used series expansion.
In this case, the template-based version performed miserably, when it should have been roughly the same speed as the table-based version. It turned out that the compiler had not optimized the template-based series expansion at all, but had implemented it as a massive set of embedded function calls. In theory, it should have done the same thing as the factorial example and generated a single move instruction. In practice, it punted when confronted with too much complexity. Consider yourself warned. Compilers don't always do what you want. Your only friend is a profiler.
One more example. Microsoft provides a set of 3D matrix functions with its Direct3D dev kit in d3dutils.cpp. One of those functions performs matrix multiplication. It looks like this:
We can replace this function with a template-based version that unrolls the loops, completely eliminating the loop overhead:
The template-based version is called like this:
The template could be more flexible by providing a dimension parameter, but I left that out for simplicity's sake. The template function calculates the next values of i, j, and k and recursively calls itself with a new count, which goes from 0 to 64. To terminate the loop, there's a template specialization that just returns. With a good compiler, the code generated should be as efficient as writing MatrixMult like this:
Unfortunately, Microsoft's compiler wasn't completely up to the task, although we did see a minor improvement over the existing version. Currently, the best performance gain comes from rewriting the entire function with all loops unrolled.
Template metaprogramming has its advantages. It can be extremely effective for mathematical and scientific libraries, and there's definite possibilities for streamlining 3D math. One public scientific computing library, calledBlitz++, is based on the template-metaprogramming concept. The performance of the library is on par with the best Fortran libraries at maximum optimization. In other cases, performance increases of 3-20 times that of a commercial C++ linear algebra library have been achieved. However, compiler support for templates is still immature. Microsoft in particular was slow to implement template functionality, and even VC 6.0 doesn't support the full C++ standard for templates. As compilers advance, template metaprogramming may take a larger place in optimization technology.
One general method of increasing efficiency is called lazy evaluation. With lazy evaluation, nothing is precomputed; you put off all your processing until the result is really needed. A complementary method is called "copy-on-write." With copy-on-write, two or more objects can share the same data until the moment when one of those objects is changed, at which point the data is physically copied and changed in one of the objects. C++ lends itself nicely to copy-on-write, since it can be added without affecting a class interface. Microsoft was able to change the internals of its own CString class to add copy-on-write functionality. Most programmers never noticed the difference because the class interface was unchanged.
Copy-on-write requires two things: reference counting and smart pointers. A reference count indicates the number of objects referring to the same piece of data. A smart pointer points to an object with a reference count. When the reference count goes to zero, the smart pointer is "smart" enough to automatically delete the object. A simple RefCount and SmartPtr class follow. Note that DownRefCount returns true when the reference count goes to zero and it's safe to delete the object.
A SmartPtr acts just like a regular pointer except in two cases: when it's copied, the reference count is incremented, and when it's destroyed the reference count is decremented. If the reference count goes to zero, the object pointed to is destroyed as well.
We create a new reference-counted bitmap class by inheriting from Bitmap and using the RefCount mixin class.
Now we can create a "smart" bitmap class. It contains a smart pointer to a reference-counted bitmap. Whereas copying a regular bitmap required a deleting and reallocating memory, copying a SmartBitmap is as simple and efficient as copying a pointer. We only need to do "expensive" operations when the bitmap is actually changed (e.g. Create'd or Destroy'ed).
The disadvantage is that we've added another level of indirection through the smart pointer. We've also required that the object constructor allocate the reference-counted object so that the smart pointer can properly delete it. Another important note: the destructor doesn't do anything! That's because the smart pointer will automatically delete the object when nobody else is referring to it. Cool.
To evaluate performance, I compared a large set of typical bitmap operations, including copying. SmartBitmap was six times as fast as the original "dumb" bitmap. Your results will vary depending on how much objects are copied (slow for dumb objects, fast for smart objects) and how much copied objects are changed (fast for dumb objects, slow for smart objects).
The nice thing about smart objects is that they can be returned by value with no performance penalty. In the case above, it's now feasible to use SmartBitmap as a return value.
Back to C++ Optimization Techniques