ymmv, but I found that -Os (optimize for size, better exploits caches) outperforms -O3 or -O4 by about 8%.
update:
only bothered testing on 4way, but tested on 3 recent Intel processors, a Xeon W3530 @ 2.80GHz (8192 KB cache), a Xeon @ 3.00GHz (2048 KB cache), and a Xeon E5430 @ 2.66GHz (6144 KB cache). All with approximately the same results.
Using version obtained from git with most recent commit 4a7f3f70b5628cb804ca4f46cf51651a1a42507f.
gcc version Ubuntu 4.4.3-4ubuntu5, CFLAGS="-O(s|3) -ftree-vectorize -march=native".
jordan