What follows is mostly mental noise produced while I was compiling ATLAS. Everyone produces plenty of mental noise all the time (right?), but this time I also happened to have some spare time to note down some of the said noise, because, well, duh, t'was compile time.
ATLAS is an implementation of BLAS and LAPACK, useful for matrix-heavy and vector-heavy (mostly scientific kind of) computations, and Atlas sources are a mix of Fortran and C. I've been building and benchmarking Atlas on my laptop running Debian and a server running Red Hat Enterprise Linux. Why would I want to do that, when both distributions ship atlas binary packages?
The default on both Debian and RHEL is a rather under-optimized binary that will work correctly on all machines. RHEL ships with a few extra binary packages optimized for different vector instruction sets (3DNow, SSE, SSE2, SSE3…), and the sysadmin is supposed to set up the machine with optimized library using alternatives system. However, for best performance, both Debian and Red Hat recommend building Atlas on the target machine.
To illustrate, let's look at an example. The table below shows execution times of DGEMM, with the stock library and the custom-built one. We're multiplying two matrices of 2000x2000 doubles – see example source code.
"Before" is the execution time when using the stock library. "After" is when using the custom-built version. (A) is a Thinkpad running Debian (4-core Intel i5 CPU running at 2.50GHz, 6GiB RAM, GCC 4.7.1, "noisy" in terms of CPU load), and (B) is big honking RHEL server (lightly loaded 32-core Intel Xeon running at 2.13GHz, 64GiB RAM, GCC 4.4.6 etc), which may characterize the "production" machines.
791% improvement in performance on the laptop? Pretty awesome. 108% improvement on the server is pretty weak. What gives?
I believe recent GCC improvements must have a role. (Note that GCC is GNU Compiler Collection, which includes gcc the C compiler and compilers for Fortran, C++, etc.) Wading through GCC's changelog and NEWS hasn't given me much of an idea about specific improvements. Building a newer version of GCC on the server doesn't sound like fun. It may be worth trying, though.
When speaking of fast code, one can't omit the Intel compiler. In theory ICC should produce better performing code on Intel CPUs. ICC is proprietary, but available on department's machines. Atlas people do not recommend using ICC though, because of ICC's mysterious status regarding IEEE floating point compliance. (I tried building Atlas using ICC, but didn't find any marked improvement. The build system is wired such that ICC compiles only interface routines.)
Compile time on the server is pretty long: 7 hours and 22 minutes. (It took less than that on the laptop, and while that too is pretty lengthy, I believe this is a sign that GCC's becoming more awesome?) Atlas 3.10.0 sources are 48M. The build tree is 227MB. Atlas runs a bunch of timing tests to figure out the best way to tune for the machine, and a good chunk of the build tree is cached results of these tests. The build fails when using ''make -j'' to speed up things a bit – is this because the dependencies are not correctly specified, or does it have something to do with the build-time tests? I don't know.
And thus goes the mental noise.