Abstract: A substantial portion of a knowledge worker's life may be spent waiting for a computer program to produce output. Users and organizations control their wait time by purchasing faster computers, adding memory, or using faster network connections. Developers of application programs have a responsibility to design their programs make the best use of these limited and expensive resources.
This document describes techniques for optimizing (improving the speed of) computer programs written in C. It focuses on minimizing time spent by the CPU and gives sample source code transformations that often yield improvements. Memory and I/O speed improvements are also discussed.
The single most effective optimization technique is to use a profiler to identify performance bottlenecks. It's often difficult to guess what part of your program is consuming the most resources, and if you base your optimization efforts on speculation instead of real data, you'll waste a lot of time speeding up the parts of your program that were fast already.
Once you have identified a bottleneck, for example a loop that is executed thousands of times, remember that the best thing to do is to redesign the program so that it doesn't need to execute the loop thousands of times. This is more effective than making the loop run 10% faster, yet just as often, which is precisely the sort of thing an optimizing compiler can do by itself. Optimization is simply waste of programmer time if any of these statements are true:
Also take into account how the program will be used. If it's a report-generator program which only needs to be run once a day, the user may start it before going off to lunch and if so there's really no point in making it finish before they get back. If it's being invoked from within another program that's even slower than yours, again the user will notice no difference. But if it handles mouse-tracking events for a GUI the user will complain about any noticeable delay.
Given that optimizing is reasonable, compile in full optimize mode and run your program on "real-world" input data. If you don't have access to real input data, choose your test input data with care: programmers tend to test with a smaller set of input data and a different variety of cases than the the program will likely encounter once it's in the hands of users.
Use the time(1) command to see if your program is compute-bound, memory
bound, or I/O bound, or for that matter if it's bound at all: even if your
program "seems" to be taking a long time to run, it may only be a second
of CPU time. You may have more "time" commands on your system than you
realize: sometimes it's built into the shell (SunOS csh for example) and
has lots of nifty options. You can also get performance information from
getrusage() if you have it, and of course from profiling programs like
gprof, prof, and tcov.
Recompile your program with profiling enabled and whatever optimization options are compatible with that. Run your program, again on real world data, and generate a profiling report. Figure out which function uses the most CPU time, then look it over very carefully and see if any of these approaches might be useful. Make one change at a time, then run the profiler again, repeating the process until there is no obvious bottleneck or the program runs sufficiently fast.
The first four are quite important; the rest are in no particular order.
Think about what the code is really doing. Become familiar with the body of literature that describes your specialty and learn and use the most appropriate algorithms, even if you didn't come up with them yourself. You should be familiar with O(n) notation, which is defined in many computer science texts.
Some of the obvious replacements:
slow algorithm replace with sequential search binary search or hash lookup insertion or bubble sort quick sort, merge sort, radix sort
Also choose an appropriate data structure. If you'll be doing a lot of insertions and deletions at random places then a linked list would be good. If you'll be doing some binary searching, an array would be better.
Some of the very things that make code clear and readable to humans also make it clear and readable to compilers. Complicated expressions are harder to optimize and can cause the compiler to "fallback" to a less intense mode of optimization. I've seen one compiler that has translation-unit limits which when overrun will cause the entire module to be de-optimized by one level.
Part of the clarity is making hunks of code into functions when appropriate. The cost of a function call is extremely small on modern machines, so optimization is NOT a valid excuse for writing ten-page functions.
If you write clean, portable code you can quickly port to the latest, fastest machine and offer that as a solution to your customers who are interested in speed.
Get a sense for how long certain operations take. Among the the slowest are opening a file, reading or writing significant amounts of data, starting a new process, searching, sorting, operations on entire arrays, and copying large amounts of data around. The fastest operations are basic elements of the language like assigning to a variable, dereferencing a pointer, or adding two integers. None of these operations take very long in and of themselves (a few microseconds) but all computer languages allow sections of code to be executed repeatedly. If you perform even the fastest operation 10 million times, it will take a noticeable amount of time. In real programs, various things happen various numbers of times and having some common sense can help you interpret the output from your profiler.
A sure sign of misunderstanding is this fragment:
if (x != 0) x = 0;
The intent is to save time by not initializing x
if it's already
zero. In reality, the test to see whether it's zero or not will take
up about as much time as setting it to zero itself would have.
x = 0;
has the same effect and will be somewhat faster.
For the intrepid hacker, there is no substitute for examining the assembler-level output the compiler generates and counting the instructions. If you do this, don't forget to include wait states in your tally. Also keep in mind that some optimization and instruction scheduling is put off until link time and won't show up in the assembler output for an individual module.
I have a small, fairly portable program that prints out the
approximate unoptimized speed of basic C operators and
library functions, speed.c . It's not
a benchmark (the measurements are not weighted or averaged in
anyway) but it'll give a beginner a numerical sense to the
cost of various things in C.
Note: (a) I have no anonymous ftp server
at my site, so you'll have to look at it in your browser and save
it to a file or copy and paste it into an editor.
(b) Be sure to compile it with optimization off.
Most of the code is trivial and would be eliminated by an optimizer;
the timing would end up being mostly zeroes.
Most compilers have several levels of optimization. Make sure you're
using the highest level. One popular compiler, gcc, has an incredible
variety of optimization levels and options. Some compilers have
special #pragmas
or keywords (for example, inline
)
which also affect optimization.
High levels of optimization can make poorly written (but compilable) code break. Less often, optimizers mangle perfectly correct code. Problems with side effects and evaluation order tend to screw things up, so you may have to fix the code up a little before you can use the highest level. Signal handlers activated at inopportune times may disturb an optimized program in ways that wouldn't disturb an unoptimized one.
gcc (with the -finline-functions
option) and a few other
compilers are capable of
inlining small functions at higher optimization levels automatically.
K&R C compilers that I've seen won't do it at all or will only do it
for library functions written in assembly (for example the math
library). C++ compilers almost universally support inline
functions.
If necessary, C functions can be recoded as macros to obtain similar speedup on compilers with no inlining capability. This should be done after the code is completely debugged. All debuggers I've seen are incapable of displaying or stepping through the expanded macro text.
An example of inlining via macros:
Old code:
int foo(a, b) { a = a - b; b++; a = a * b; return a; }
New code:
#define foo(a, b) (((a)-(b)) * ((b)+1))
The extra parentheses are necessary to preserve grouping in case
foo
is used in an expression with higher precedence than
*
or in case a
and/or b
contain
subexpressions with lower precedence than +
or -
.
Comma expressions and
do { ... } while(0)
can be used for more
complicated functions, with some restrictions. The
do-while
macros
let you create local variables for use in the macro, but you can't
return
a value to the expression the macro is used in.
The opposite is true for macros using comma expressions.
Some caveats:
Many compilers (e.g. gcc -funroll-loops
) will do this,
but if you know that yours doesn't you can change your source code
a bit to get the same effect.
Old code:
for (i = 0; i < 100; i++) { do_stuff(i); }
New code:
for (i = 0; i < 100; ) { do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; do_stuff(i); i++; }
This way the test for i < 100
and the branch back up to the top of
the loop only get executed 11 times rather than 101. Loop unrolling
works best when the loop is executed a fixed non-prime number of times
and the iteration variable is only modified in one place (aside from
its initialization). If do_stuff()
didn't make use of
i
, all the
little i++
's could be replaced by a single
i += 10
. Re-arranging the
for
loop into a
do-while
loop can make the 11 into 10. If the loop
only went to, say, five rather than 100 you could unroll the loop
completely and eliminate the branching and testing entirely.
For computations which converge on some result, compilers will often refuse to unroll on the grounds that unrolling will change the result. If the application is not sensitive to excess precision you can arrange to test the convergence less often by duplicating the interior of the loop. This is especially useful if the test for convergence is expensive in relation to the calculation of the result.
An unrolled loop is larger than the "rolled" version and so may no
longer fit into the instruction cache (on machines which have them).
This will make the unrolled version slower. Also, in this example,
the call to do_stuff()
overshadows the cost of the loop, so any
savings from loop unrolling are insignificant in comparison to what
you'd achieve from inlining in this case.
If you happen to be using a vectorizing compiler, loop unrolling can interfere with the vector optimizations.
The idea is to combine adjacent loops which loop over the same range
of the same variable. Assuming nothing in the second loop indexes
forward (for example array[i+3]
), you can do this:
Old code:
for (i = 0; i < MAX; i++) /* initialize 2d array to 0's */ for (j = 0; j < MAX; j++) a[i][j] = 0.0; for (i = 0; i < MAX; i++) /* put 1's along the diagonal */ a[i][i] = 1.0;
New code:
for (i = 0; i < MAX; i++) { for (j = 0; j < MAX; j++) a[i][j] = 0.0; /* initialize 2d array to 0's */ a[i][i] = 1.0; /* put 1's along the diagonal */ }
And now the incrementing and testing of i
is done only
half as often. Under some circumstances, locality of reference may be
better, improving cache behavior. (The example above was lifted
from Aho & Ullman.)
Some machines have a special instruction for decrement and compare with 0. Assuming the loop is insensitive to direction, try this replacment:
Old code:
for (i = 1; i <= MAX; i++) { ... }
New code:
i = MAX+1; while (--i) { ... }
Though watch out for problems with lookahead caches as noted below in MARCH FORWARD. If you plan on doing this in combination with pointer arithmetic note that while ANSI C has a special rule that allows you to set an index to one element past the end of the array, there is no similar rule for one element before the beginning of the array.
Strength reduction is the replacement of an expression by a different expression that yields the same value but is cheaper to compute. Many compilers will do this for you automatically. The classic examples:
Old code:
x = w % 8; y = pow(x, 2.0); z = y * 33; for (i = 0; i < MAX; i++) { h = 14 * i; printf("%d", h); }
New code:
x = w & 7; /* bit-and cheaper than remainder */ y = x * x; /* mult is cheaper than power-of */ z = (y << 5) + y; /* shift & add cheaper than mult */ for (i = h = 0; i < MAX; i++) { printf("%d", h); h += 14; /* addition cheaper than mult */ }
Also note that array indexing in C is basically a multiply and an add. The multiply part can be subjected to strength reduction under some circumstances, notably when looping through an array.
Any part of a computation that does not depend on the loop variable and which is not subject to side effects can be moved out of the loop entirely. Most compilers are pretty good at doing this. Try to keep the computations within the loop simple anyway, and be prepared to move invariant computations out yourself: there may be some situations where you know the value won't vary, but the compiler is playing it safe in case of side-effects. "Computation" here doesn't mean just arithmetic; array indexing, pointer dereferencing, and calls to pure functions are all possible candidates for moving out of the loop.
In loops which call other functions, you might be able to get some
speedup by ripping the subroutines apart and figuring out which parts
of them are loop-invariant for that particular loop in their caller
and calling those parts ahead of time. This is not very easy and
seldom leads to much improvement unless you're calling subroutines
which open and close files repeatedly or malloc
and free
large amounts
of memory on each call or something else drastic.
A common but not-always-optimized-away case is the repeated use of an expression in successive statements. This doesn't have to be related to a loop at all. You can try this replacement and see if it helps:
Old code:
total = a->b->c[4]->aardvark + a->b->c[4]->baboon + a->b->c[4]->cheetah + a->b->c[4]->dog;
New code:
struct animals * temp = a->b->c[4]; total = temp->aardvark + temp->baboon + temp->cheetah + temp->dog;
Older C compilers were allowed to regroup arithmetic expressions to some extent. ANSI codified that arithmetic expressions are evaluated as they are grouped so as to avoid any unwanted surprises. What this means is that
float a, b, c, d, f, g; ... a = b / c * d; f = b * g / c;
will not been seen as having the common subexpression
b / c
. If you
rewrite the second expression:
float a, b, c, d, f, g; ... a = b / c * d; f = b / c * g;
an ANSI C compiler is now free to compute
b / c
only once. Note that
the new code may behave differently for overflow or provide slightly
different results with floating point.
In a section of code which deals with several alternative situations,
place at the beginning the tests and the code for the situations which
occur most often. Frequently, this takes the form of a long train of
mutually exclusive if-then-else
's, of which only one will get
executed. By placing the most likely one first, fewer
if
's will need
to be performed over the long term.
But if the conditions are simple things like x == 3
, consider using
a switch
statement. Some compilers are quite sophisticated in how
they translate switch
statements.
First, let me define tail recursion elimination (TRE). When a recursive function calls itself, an optimizer can, under some conditions, replace the call with an assembly level equivalent of a "goto" back to the top of the function. The saves the effort of growing the stack, saving and restoring registers, and any other function call overhead. For very small recursive functions that make zillions of recursive calls, TRE can result in a substantial speedup. With proper design, the TRE can take a recursive function and turn it into whatever is the fastest form of loop for the machine.
Tail recursion elimination (TRE for short) has been around for a long time. It originated with "functional" languages like LISP which do so much recursion that TRE is a necessity. C, C++, and the "pascal-like" languages fall into the "imperative" language category and extremely efficient programs, recursive or not, can be written and compiled without the benefit of TRE and still perform on par with (and often better than) a similar LISP program. What I'm leading up to is that TRE is not automatically present in every modern optimizer, though many certainly do have it.
Back to the conditions I mentioned earlier. In order for TRE to be a safe optimization the function must return the value of the recursive call without any further computation. Example:
int isemptystr(char * str) { if (*str == '\0') return 1; else if (! isspace(*str)) return 0; else return isemptystr(++str); }
The above can have TRE applied to the final return statement because the returned value from this invocation of isemptystr will be exactly that of the n+1th invocation, with no further computation.
And now a counterexample:
int factorial(int num) { if (num == 0) return 1; else return num * factorial(num - 1); }
The above cannot have TRE applied because the returned value is not used directly: it is multiplied by num after the call, so the state of that invocation must be maintained until after the return. Even a compiler that supports TRE cannot use it here.
And now a counter-counterexample, a rewrite of the factorial program to allow TRE optimization.
int factorial(int num, int factor) { if (num == 0) return factor; else return factorial(num - 1, factor * num); }
In my experience, the optimizers for imperative languages don't bother to perform this sort of rewriting. I have seen it done with Scheme compilers, so it is possible. Programmers who write code this way (other than for example purposes!) should be beaten about the head and shoulders with a blunt dandruff-removing object.
Even if your compiler implements TRE optimization, you should not assume that it automatically has done so for you. You may have to rewrite even the simplest recursive function before TRE can be applied. If doing so reduces the readability of the function, the effort is highly questionable.
There is a large subset of recursive algorithms that have simple iterative counterparts. C compilers are very, very good at optimizing loops and can do so without the sometimes-onerous conditions that TRE requires, so if you must optimize at the source level, consider using plain iteration before TRE-friendly rewriting.
Finally, if the recursive function contains loops or large amounts of code, TRE will be only marginally helpful, since TRE only optimizes the recursion, not anything about the algorithm itself. Function calls are very fast, and optimizing out something that is already very fast will quickly run into the law of diminishing returns.
Consider using lookup tables especially if a computation is iterative or recursive, e.g. convergent series or factorial. (Calculations that take constant time can often be recomputed faster than they can be retrieved from memory and so do not always benefit from table lookup.)
Old code:
long factorial(int i) { if (i == 0) return 1; else return i * factorial(i - 1); }
New code:
static long factorial_table[] = {1, 1, 2, 6, 24, 120, 720 /* etc */}; long factorial(int i) { return factorial_table[i]; }
If the table is too large to type, you can have some initialization
code compute all the values on startup or have a second program
generate the table and printf
it out in a
form suitable for #include
-ing
into a source file. At some point, the size of the table will have a
noticeable affect on paging space. You could have the table cover the
first N cases then augment this with a function to compute the rest.
As long as a significant number of the values requested are in the
table there will be a net win.
For nearly all situations, the library qsort
function is
speedy enough to make implementation of your own sort algorithm
unnecessary. Focus your optimization efforts on speeding up the
comparison function. Often the
strcmp
optimizations discussed later in this document
are helpful.
Consider massaging the input data ahead of time in such a way as to make the comparison routine run faster:
If you are sorting array of structs or other large objects, sort an array of pointers to them instead so the sort function won't waste time copying whole structs around during the sort.
Evaluate ratio of insertions to lookups.
If you are writing the sort function yourself, write a specific sort for a specific type of array and a specific comparison function. Incorporate everything into your sort function, essentially inlining the comparison function and swapping code.
For custom sorts patterned after qsort you can make special copies
of the sort which assume that data being swapped is a convenient machine
size, like
short
,
long
, or
double
.
When the width of the object
being sorted matches the size of a basic type (and the alignment is
suitable) you can do a direct swap instead of calling memcpy
.
Make sure assert
or other debugging statements are
commented out or #defined away.
Avoid referring to global or static variables inside the tightest
loops. Don't use the volatile
qualifier unless you really mean it.
Most compilers take it to mean roughly the opposite of
register
, and
will deliberately not optimize expressions involving the volatile
variable.
Avoid passing addresses of your variables to other functions. The optimizer has to assume that the called function is capable of stashing a pointer to this variable somewhere and so the variable could get modified as a side effect of calling what seems like a totally unrelated function. At less intense levels of optimization, the optimizer may even assume that a signal handler could modify the variable at any time. These cases interfere with placing variables in registers, which is very important to opimization. Example:
a = b(); c(&d);
Because
d
has had its address passed to another function, the compiler
can no longer leave it in a register across function calls. It can
however leave the variable
a
in a register. The register
keyword can be used to
track down problems like this; if d
had been declared register the
compiler would have to warn that its address had been taken.
While functions and modularity are a good thing, a function call inside an oft-executed loop is a possible bottleneck. There are several facets to the problem, some of which I've alluded to above. Aside from the expense of executing the instructions in the other function:
Straight-line code, even with an extra statement or two, will run
faster than code full of
if
's,
&&
's,
switch
's, and
goto
's. Pipelining
processors are much happier with a steady diet of sequential
instructions than a bunch of branches, even if the branches skip some
unnecessary sections.
Most of the C library str*
and mem*
functions operate in time
proportional to the length(s) of the string(s) they are given. It's
quite easy to loop over calls to these and wind up with a significant
bottleneck. Some things that may ease the situation:
strlen
Avoid calling strlen()
during a loop involving the string itself.
Even if you're modifying the string, it should be possible to
rewrite it so that you set x = strlen()
before the loop and then
x++
or x--
when you add or remove a character.
strcat
When building up a large string in memory using strcat
, it will
scan the full (current) length of the string on each call. If
you've been keeping track of the length anyway (see above) you can
index directly to the end of the string and
strcpy
or memcpy
to there.
strcmp
You can save a little time by checking the first characters of
the strings in question before doing the call. Obviously, if the
first characters differ, there's no reason to call
strcmp
to check
the rest. Because of the non-uniform distribution of letters in
natural languages, the payoff is not 26:1 but more like 15:1 for
uppercase data.
#define QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \ (int) ((unsigned char) *(a) - \ (unsigned char) *(b)) : \ strcmp((a), (b)))
But watch out:
strcmp
aren't just simple
variables.
An entirely different way to speed up strcmp
is to place all your
strings into a single array, in order. Then you only have to
compare the pointers, not the strings. If the point of all the
calls to strcmp
is to search for a value from a large, known set
and you expect that you'll be doing many such searches then you'll
want to invest in a hash table.
strlen(s) == 0
Empty string tests are common, and even experienced programmers
may use strlen for this purpose. But if the strings you are
working with are of substantial size, strlen
will dutifully check
each character until it finds the NUL terminator. Replace with
*s == '\0'
which saves the function call overhead plus it only has to check
the first character. There can be a substantial improvement if
your character strings are fairly long.
strcpy(s, "")
*s = '\0'
works just fine and saves a function call.
strncpy(s, t, n)
Be aware that strncpy
pads short strings with 0's. Except for the
first one, which terminates the string, the extra zeroes are
typically never used, so some time is wasted, though typically not
very much.
memcpy/memmove
Generally, memcpy
is faster than memmove
because
it can assume
that its arguments don't overlap. If you are tempted to replace
these with your own versions, be sure to verify that your version
(a) works and (b) is faster.
In some really weird cases (early RISC chips with FPUs, which includes many SPARC-based machines) you can get some speed by turning your integer multiplies, divides, and modulos into float operations. This does two things: the FPU is taking some of the load off the main CPU so it can get some other stuff done; but mostly some RISC machines emulate integer multiply and divide in software and the FPU can actually do it faster anyway. This is not a sure thing and you should try it both ways on all the machines you plan to run on before you do this. Also, converting back-and-forth between ints and floats can take up considerable time. I have heard this is an acute problem on Apollos.
On many machines,
float
s work faster than double
s. If the bottleneck
involves FP arithmetic and you don't need the extra precision, change
the pertinent variables to float
and see what happens. But similar to
above, the cost of converting between
float
s and
double
s can outweigh
the benefits if applied indiscriminately. Note that on K&R compilers
and ANSI compilers used w/o prototypes, there is an automatic
conversion of
float
s to
double
s in function calls and in many
expressions.
gcc produces somewhat faster code than the compilers that come with some OS's. Try compiling your code with all the compilers you have at your disposal and use the fastest one for compiling the one or two functions which are bottlenecks. For the rest just use the one that gives the most informative error messages and produces the most reliable output. This is more difficult with C++ since different compilers can use incompatible name-mangling schemes.
Compilers are evolving even as we speak, so keep up with the latest revision if possible.
A typical cause of stack-related problems is having large arrays as local variables. In that case the solution is to rewrite the code so it can use a static or global array, or perhaps allocate it from the heap. A similar solution applies to functions which have large structs as locals or parameters. (On machines with small stacks, a stack bound program may not just run slow, it might stop entirely when it runs out of stack space.)
Recursive functions, even ones which have few and small local variables and parameters, can still affect performance. On some ancient machines there is a substantial amount of overhead associated with function calls, and turning a recursive algorithm into an iterative one can save some time.
A related issue is last-call optimization. Many LISP and Scheme compilers do this automatically, but few C compilers support it. Consider this example:
int func1() { int a, b, c, etc; do_stuff(a, b, c) if (some_condition) return func2(); else return 1; }
Since func1()
's last statement is to call
func2()
and func1()
has no need of its
variables after that point, it can remove its stack frame. When
func2()
is done it returns directly to
func1()
's caller. This (a) reduces the maximum depth of
the stack and (b) saves the execution of some return-from-subroutine
code as it will get executed only once instead of twice (or more,
depending on how deeply the function results are passed along.)
If
func1()
and
func2()
are the same function (recursion) the compiler
can do something else: the stack can be left in place and the call can
be replaced by a
goto
back to the top of the function. This is called
tail-recursion elimination.
Estimates vary widely, but a competent human writing assembly-level code can produce code which runs about 10% faster than what a compiler with full optimization on would produce from well-written high-level source. Of course, this is not a portable solution. RISC assembler is especially difficult to write by hand.
Experts vary on the approach one should take. Some suggest writing your own assembler version of a function from scratch in hopes that you'll find a novel way of calculating the result while others recommend taking the compiler's version as a starting point and just fine-tuning it.
Dynamically linked shared libraries are a really nifty thing. In many cases though, calling a dynamically linked function is slightly slower than it would be to call it statically. The principal part of this extra cost is a one-time thing: the first time a dynamically called function is called there is a bit of searching going on, but thereafter the overhead should be a very tiny number of instructions.
For applications with thousands of functions, there can be a noticeable lag at startup. Linking statically will reduce this, but defeats (to some extent) the benefits of code sharing that dynamic libraries can bring. Often, you can selectively link some libraries as dynamic and others as static. For example, the X11, C and math libraries you'd link dynamically (since other processes will be using these also and the program can use to the copy already in memory) but still link your own application-specific libraries statically.
As with other machine-specific code, you can use #ifdef to set off
sections of code which are optimized for a particular machine.
Compilers don't predefine RISC
or SLOW_DISK_IO
or
HAS_VM
or VECTORIZING
so you'll have to come
up with your own and encode them into a makefile or header file.
The foremost consideration when optimizing memory access, whether for
virtual memory or cache considerations, is locality of
reference. That is the property of a program to use addresses
which are near (in both time and location) other recent references to
memory. The main difference between optimizing for VM and optimizing
for cache is scale: VM pages can be anywhere from 0.5kb to 8kb and
beyond and can take tens of milliseconds to read in from disk. Cache
blocks typically range from 16 bytes to 256 bytes and get read in in
the tens of microseconds. A program which forces many VM pages or
cache lines to load in quick succession is said to be "thrashing."
You can affect locality of reference by changing the order in which
you access and allocate things and by splitting your data structures
into "frequently used" and "infrequently used" segments and
allocating all the "frequently used" stuff together. (But don't fool
yourself into thinking that
Search and sort algorithms differ widely in their patterns of
accessing memory. Merge sort is often considered to have the best
locality of reference. Search algorithms may take into
consideration that the last few steps of the search are likely
to take place in the same page of memory, and select a different
algorithm at that point.
When stepping through multi-dimensional arrays, be sure to increment
the rightmost index first. This natural for C programmers, but
backwards for FORTRAN. If you find yourself writing C code like this,
you probably need to be re-educated:
Instead of copying strings, arrays, or large structs, consider copying
a pointer to them. As long as you're done using the pointer before
you modify the string, array, or struct you're okay.
ANSI C now requires that structs are pass-by-value like everything
else, thus if you have extraordinarily large structs, or are making
millions of function calls on medium-sized ones, you might consider
passing the struct's address instead, after modifying the called
function so that it doesn't perturb the contents of the struct.
If the parts of your program making heaviest use of memory are
doing so by accessing elements in "parallel" arrays you can
combine them into an array of structs so that the data for
a given index is kept together in memory.
If you already have an array of structs, but find that the
critical part of your program is accessing only a small
number of fields in each struct, you can split these
fields into a separate array so that the unused fields
do not get read into the cache unnecessarily.
On machines with alignment restrictions you can sometimes save a tiny
amount of space by arranging similarly-typed fields together in a
structure with the most restrictively aligned types first. There may
still be padding at the end, but eliminating that can destroy the
performance gains the alignment was meant to provide.
ANSI C makes few guarantees about internal struct padding, but an
arrangement like that usually results in minimal losses even on the
pickiest of RISC machines.
A typical use of
Paradoxically, increasing the size and alignment of a data structure
to match (or to be an integer fraction or multiple of) the cache line
size may increase performance. The rationale is that if the data structure
is an odd size relative to the cache line size, it may overlap two cache
lines thus doubling the time needed to read it in from main memory.
The size of a data structure can be increased by adding a dummy
field onto the end, usually a character array. The alignment
is harder to control, but usually one of these techniques will work:
Theoretically, it makes no difference whether you iterate over an
array forwards or backwards, but some caches are of a
"predictive" type that tries to read in successive cache lines even
before you need them. Because these caches must work quickly, they
tend to be fairly dim and rarely have the extra logic for predicting
backwards traversal of memory pages.
One common method of mapping memory addresses to cache lines is to
strip off leading bits from the address.
Take as an example a hypoythetical direct-mapped 1MB cache with
128-byte cache lines and a program which uses 16MB of memory, all
happening on a machine with 32 bit addresses. The simplest way for the
cache to map the memory into the cache is to mask off the first 12 and
the last 7 bits of the address, then shift to the right 7 bits. What
we end up with is a cache that maps any two addresses exactly 8192
(2^13) bytes apart in main memory to the same cache line. If the
program happens to use an array of 8192 byte structs, and refers to
just one element in each one while processing the whole array, every access
will map to the same cache line and force a reload, which is
considerable delay.
This seems like a contrived situation, but the same problem arises for
any struct which is a multiple of 8192 (in the above example). If the
struct were half the size, the problem would be half as severe (and so
on) but probably still noticeable. Since we have used only one (one!)
cache line, any future access to the array will have to reload cache
lines, in spite of the supposed 1MB capacity.
The solution in this case is probably to isolate the one field into a
separate array. Say the field is 4 bytes wide; then the cache would
only need a reload every 32 iterations through the array. This is
probably seldom enough that the cache lines can be read in advance of
their need and processing can occur at full speed.
In some applications,
in a header someplace. I would emphasize that this has narrow
utility. Programs with long lifetimes (for example, background
demons) and programs which use a significant portion of physical
memory should carefully
Allocate less stuff in the first place; this is kind of trite since
most programmers don't go to the trouble of calling
Buddy system allocators are subject to massive amounts of internal
fragmentation, and often their worst case is when you allocate
something whose size is near, or slightly larger than a power of two,
which is not exactly uncommon. They also tend to put things of like
size together in preference to putting things allocated in sequence
together; this affects locality of reference, sometimes in a good
way, sometimes not.
You may have
SunOS (and perhaps others) have
In the current computer era, the costs of main memory, cache memory,
and disk storage for VM are steadily decreasing. It is entirely
possible that rewriting the code won't improve the situation as much as
simply throwing money at the problem and buying more memory.
Some common sense is required, since for mass-market software this
isn't always practical. One side of the argument is that you can't
just tell everyone to spend $500 in upgrades just to run your $10
shareware program. On the other hand, increasing swap space by 16MB
will cost, averaged over time and population, about US$4 at current
(1997) rates.
In the best of all worlds, you would run a cache profiler to
find out exactly what parts of your program or data structures
are causing cache problems. Few development environments
include one. Often, the best that one can do is infer from an execution
profile that a cache problem might exist in a certain function,
especially if large arrays or many data structures are involved.
I/O (in Unix) usually puts your process to sleep for a time. The request
for I/O may finish fairly quickly but perhaps some other process is busy
using the CPU and your process has to wait. Because the length of the
wait is somewhat arbitrary and depends on what exactly the other processes
are doing, I/O bound programs can be tricky to optimize.
Buffered I/O is usually (but not always) faster than unbuffered.
Some gain can be wrought from using a larger than normal sized buffer;
try out
Consider using
Again, consider
You can set up a file descriptor to be non-blocking (see ioctl(2) or
fcntl(2) man pages) and arrange to have it send your process a signal
when the I/O is done; in the meantime your process can get something
else done, including perhaps sending off other I/O requests to other
devices. Significant parallelism may result at the cost of program
complexity.
Multithreading packages can aid in the construction of programs which
utilize asynchronous I/O.
If your program spews out a lot of data to the screen, it's going to
run slow on a 1200 baud line. Waiting for the screen to catch up
stops your program. This doesn't add to the CPU or disk time as
reported for accounting purposes, but it sure seems slow to the user.
Bitmap displays are subject to a similar problem: xterms with jump
scroll mode off can be quite slow at times.
A general solution is to provide a way for the user to squelch out
irrelevant data. Screen handling utilities like curses can speed
things up by reducing wasted screen updates.
These have some weird properties. You may find that sending short,
infrequent messages is extremely slow. This is because small messages
may just sit in a buffer for a while waiting for the rest of the
buffer to get filled up. There's some socket options you can set to
get around this for TCP; or switch to an non-streaming protocol
such as UDP. But the usual limitation is the network
itself and how busy it is.
For the most part there's no free lunch and sending more data simply
takes more time. However, a local network with a switched hub,
bandwidth should have a clear theoretical upper limit. If you aren't
getting anywhere near it, you might be able to blame the IP
implementation being used with your OS. Try switching to a different
"IP stack" and see if that helps a bit.
On Unix machines there are typically two socket libraries
available, BSD sockets, and TLI. One is usually implemented
in terms of the other and may suffer an extra buffer copy
or other overhead. Try them both and see which is faster.
This replacement for stdio written by Phong Vo and Dave Korn is
available through
Some of the things that may trip up novices:
1. Programmers tend to over-estimate the usefulness of the programs
they write. The approximate value of an optimization is:
even if the program will be run hundreds of times by thousands
of users, an extra day spent saving 40 milliseconds probably
isn't going to help.
2. Machines are not created equal. What's fast on one machine may be
slow on another. I don't just mean abstract machines like SPARC or
MC68000, but actual machines with serial numbers. Extra interface
cards, different disks, number of logged in users, extra memory,
background demons, and just about anything else can affect the speed
of various parts of a program and influence which part of the program
is a bottleneck and its speed as a whole.
A particular problem is that many programmers are power users and have
machines loaded with memory and a math co-processor and tons of disk space
while the users get bare-bones CPUs and run over a network. The programmer
gets a distorted view of the program's performance and may fail to
optimize the parts of the program which are taking up the user's time,
such as floating point or memory intensive routines.
3. As I've mentioned along the way, many of these optimizations may
already be performed by your compiler!
4. Don't get into the habit of writing code according to the above
rules of optimization. Only apply them after you have discovered
exactly which function is the problem. Some of the rules if applied
globally would make the program even slower. Nearly all make the
intent of the code less obvious to a human reader, which can get you
in trouble if you have to fix a bug or something later on. Be sure to
comment optimizations--the next programmer may simply assume it's ugly
code and rewrite it.
5. Spending a week optimizing a program can easily cost thousands of
dollars in programmer time. Sometimes, it's easier to just buy a
faster CPU or more memory or a faster disk and solve the problem that
way.
6. Novices often assume that writing lots of statements on a single
line and removing spaces and tabs will speed things up. While that
may be a valid technique for some interpreted languages, it doesn't
help at all in C.
LOCALITY OF REFERENCE
malloc
always allocates adjacent chunks of
memory on each successive call. You have to allocate one giant chunk
and dole it out yourself to be certain. But this can lead to other
problems.)
COLUMN-MAJOR ACCESSING
float array[20][100];
int i, j;
for (j = 0; j < 100; j++)
for (i = 0; i < 20; i++)
array[i][j] = 0.0;
DON'T COPY LARGE THINGS
SPLIT OR MERGE ARRAYS
REDUCE PADDING
Old code:
/* sizeof = 64 bytes */
struct foo {
float a;
double b;
float c;
double d;
short e;
long f;
short g;
long h;
char i;
int j;
char k;
int l;
};
New code:
/* sizeof = 48 bytes */
struct foo {
double b;
double d;
long f;
long h;
float a;
float c;
int j;
int l;
short e;
short g;
char i;
char k;
};
char
or short
variables is
to hold a flag or mode bit. You can combine several of these flags
into one byte using bit-fields at the cost of data portability. You
might instead use bitmasks and the &
operator to
extract the values; this is more portable but also more tedious.
INCREASE PADDING
malloc
instead of a static array. Some
malloc
s automatically allocate storage suitably
aligned for cache lines. (And some don't.)
memalign
) which
guarantees minimal alignment.
MARCH FORWARD
BEWARE THE POWER OF TWO
MEMORY LEAKS
malloc
and free
have some interesting behavior
at times, and are often implemented in completely different ways on
different machines. Some malloc
s have a way of "tuning" for
performance (mallopt
for example.)
malloc
takes very little time and free
takes a
lot, so if your program uses a bunch of memory but doesn't really
re-use any of it before it dies, the program might run faster if you
just do this
#define free(x) /* no-op */
free
everything as soon as it's no longer
needed.
BE STINGY
malloc
unless they
have a good reason, but there might be a few things that could go on
the stack instead. If you have some data structure which is mostly
"holes" like a hash table or sparse array, it might be to your
advantage to use a smaller table or list or something.
HINTING
mallopt
on your (Unixish) system. This can be used to
get malloc
to allocate small blocks fairly quickly and painlessly.
madvise()
and vadvise()
system
calls. These can be used to clue the OS in as to what sort of way
you're going to be accessing memory: randomly, sequentially, mark
some pages as no longer needed, etc.
FIX THE PROBLEM IN HARDWARE
CACHE PROFILERS
SEQUENTIAL ACCESS
setvbuf()
. If you aren't worried about portability you can
try using lower level routines like read()
and write()
with large
buffers and compare their performance to fread()
and fwrite()
. Using
read()
or write()
in a single-character-at-a-time mode is especially
slow on Unix machines because of the system call overhead.
mmap()
if you have it. This can save effort in several
ways. The data doesn't have to go through stdio which saves a buffer
copy. Depending on the sophistication of the paging hardware, the
data need not even be copied into user space; the program can just
access an existing copy. mmap()
also lends itself to read-ahead;
theoretically the entire file could be read into memory before you
even need it. Lastly, the file is can be paged directly off the
source disk and doesn't have to use up virtual memory.
RANDOM ACCESS
mmap()
if you have it. If there's
a trade off between
I/O bound and memory bound in your program, consider a lazy-free of
records: when memory gets tight, free unmodified records and write
out modified records (to a temporary file if need be) and read them in
later. Though if you take the disk space you'd use to write the
records out and just add it to the paging space instead you'll save
yourself a lot of hassle.
ASYNCHRONOUS I/O
TERMINALS
SOCKETS
SFIO
netlib@research.att.com
. By slightly
changing the calling sequences they have enabled several nifty
optimizations and claim a substantial improvement.
number of runs × number of users × time savings × user's salary
- time spent optimizing × programmer's salary
malloc
/free
part of the C library. Buddy system
allocators distribute blocks in memory depending in part on their
sizes and not merely by the order in which the were allocated.
Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman Compilers: Principles, Techniques, and Tools (The "Dragon Book") ISBN 0-201-10088-6
The Dragon Book has quite a bit of detail about the inner workings of optimizers and what types of optimizations are easiest for the compiler to perform. By reading it you may gain some insight into what you can reasonably expect a modern compiler to do for you. There is an entry in the Jargon file for it as well:"Dragon Book n. The classic text "Compilers: Principles, Techniques and Tools", by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman (Addison-Wesley 1986; ISBN 0-201-10088-6), so called because of the cover design featuring a dragon labeled `complexity of compiler design' and a knight bearing the lance `LALR parser generator' among his other trappings. This one is more specifically known as the `Red Dragon Book' (1986); an earlier edition, sans Sethi and titled "Principles Of Compiler Design" (Alfred V. Aho and Jeffrey D. Ullman; Addison-Wesley, 1977; ISBN 0-201-00022-9), was the `Green Dragon Book' (1977). (Also `New Dragon Book', `Old Dragon Book'.) The horsed knight and the Green Dragon were warily eying each other at a distance; now the knight is typing (wearing gauntlets!) at a terminal showing a video-game representation of the Red Dragon's head while the rest of the beast extends back in normal space. See also book titles."
American National Standard for Information Systems -- Programming Language -- C. ANSI X3.159-1989 aka ISO 9899-1990
Optimizers will sometimes mangle code that depends on side effects or order of evaluation so you should be aware of what's legal and what's not. By sticking with the letter of the law you should be able to use the highest level of optimization safely. The Rationale (available at several ftp sites) points out some of the optimization-related weirdnesses of C.Dowd, Kevin & Severance, Charles. High Performance Computing, 2nd Edition. O'Reilly & Associates. ISBN 1-56592-312-XI've referred to ANSI C throughout the document; substitute the term ISO C if you prefer.
This is a good introduction and reference for optimization of programs for RISC and other modern architectures. There is coverage of programming for parallelism, techniques for minimizing memory cache misses, and a lot of common sense guidelines sprinkled throughout about the proper way to approach the problem. Ordering information is available at the O'Reilly & Associates web site: http://www.ora.com/catalog/hpc2/
Kernighan, Brian & Ritchie, Dennis. The C Programming Language Second Edition. Prentice Hall, Englewood Cliffs NJ. ISBN 0-13-110370-9
A more widely available, more readable, and cheaper reference than the Standard. There isn't any specific advice about optimization though.
Knuth, Donald. The Art of Computer Programming Vol. 3 Searching and Sorting. Addison-Wesley, Reading MA. ISBN 0-201-03803-X
Knuth presents in excruciating detail the time and space complexity of nearly every imaginable search and sort algorithm. Since many programs are constrained either by a search or sort, or by some other algorithm which boils down to one of those, you may find this interesting reading. The examples are written in rather odd pseudo-assembly language called MIX which appears (to me) to have been designed specifically so that you can talk about exactly how many CPU cycles something takes.
Alvin R. Lebeck and David A. Wood "Cache Profiling and the SPEC Benchmarks: A Case Study,"
IEEE Computer, vol. 27, no. 10, Oct. 1994, pp. 15-26.
http://www.cs.duke.edu/~alvy/papers/cprof.html
Research paper describes a UNIX/X-windows cache-profiler system and their success in using it to optimize the SPEC family of benchmarks.
Mulder, Art. comp.windows.x: Getting more performance out of X.
http://www.ualberta.ca/~amulder/speedup-x-faq.txt
Many nifty GUI applications are constrained by limitations of the X client/server protocol or by the speed of the server itself. A profiling program run only on the application itself will not be able to pinpoint the true bottlenecks since they are in a different process. This describes some of the common problems and what to do about them, both from a programmer's viewpoint and for regular users.
NULLSTONE Corporation. NULLSTONE Optimization Categories,
http://www.nullstone.com/htmls/category.htm
A laundry list of the optimizations that compilers can and should do.
Pure Atria. Performance Engineering,
http://www.pureatria.com/asq/glperf.html
One corporation's viewpoint on optimization strategies.Sedgewick, R. Algorithms in C. Addison-Wesley, Reading MA, 1990. ISBN 0-201-51425-7.
Contains most of the classic algorithms, well presented.Standish, Thomas. Data Structure Techniques. Addison-Wesley, Reading MA. ISBN 0-201-07526-4.
This has a fairly complete description of everything you could ever want to know about hash tables, stacks, linked lists, arrays, strings, trees, queues and the algorithms for maintaining each. My copy is fairly old so it may be out of print or revised or something by now.Summit, Steve et al. Comp.lang.c Answers to Frequently Asked Questions.
Very little of this has to do with optimization specifically, but as with the ANSI standard, it is valuable to know what portions of the language are safe to exercise to their limits before you go a-hacking. Also available in book form.
Unix man pages. gprof(1) tcov(1) prof(1) time(1) csh(1) csh_builtins(1) monitor(3) cc(1) lorder(1) tsort(1) malloc(3) Included with most installations of the Unix[tm] operating system.
Specifically, check the man pages to find out what compile options you need for profiling. The SEE ALSO section of man pages will occasionally point you in interesting directions.
Abrash, Michael. Zen of Program Optimization. ISBN 1-883577-03-9
Lots of details about optimizing Intel x86 code.Bentley, Jon. Writing Efficient Programs. Prentice Hall. ISBN 0-13-970244-X.
Bentley, Jon. Programming Pearls. Addison-Wesley. 1986/1989 ISBN 0-201-10331-1
Jackson, Michael A. Principles of Program Design. Academic Press, London and New York, 1975.
"Jackson is the source of the quote:Kernighan, Brian W and Plauger, P.J. The Elements of Programming Style Second Edition. McGraw-Hill. 1978. ISBN 0-07-034207-5.There are two rules for when to optimize:
- Don't do it.
- (For experts only) Don't do it yet."
"The Elements of Programming Style has the most compelling discussions on efficiency (not to mention numerous other matters) that I've ever had the pleasure to read. Many of the programming blunders which the book dissects stem from misguided or heavyhanded optimization attempts; chapter 7 is devoted to Efficiency and Instrumentation, with some extremely convincing examples of (deprecated) code which uses poor algorithms and/or simpleminded source-level slower than straightforward code."Kozen, Dexter C. The Design and Analysis of Algorithms. Springer-Verlag, New York. 1992.
"The book treats a number of advanced algorithms and data structures, among the topics covered are: Splay Trees, Binomial Heaps, Fibonacci Heaps, Union-Find algorithms and Random Search Trees. It is not a book that I would recommend to someone who doesn't have a good understanding of algorithmics and mathematics, some of the topics covered are also very specialized, but it may be a worthwhile addition to some of the more basic books on algorithmics."Kruse, Robert L.; Bruce P. Leung; Clovis L. Tondo. Data Structures and Program Design in C. Prentice Hall. ISBN 0-13-725649-3
H. Massalin, "Superoptimizer: a look at the smallest program", Proc. ASPLOS II, 1987, pp. 122-127.
Spuler, David. C++ and C Efficiency. Prentice Hall, Sydney. ISBN 0-13-096595-2
Wirth, Niklaus. Algorithms + Data Structures = Programs.
I would especially like to thank these people for their help with grammar, catching typos, assistance with the structure of the document and most of all for taking the time to explain some of these optimization techniques to me.
If you have any suggestions or comments please let me know.
Copyright © 1997 Michael E. Lee
mikey -at- ontek.com