All - well I wanted to post this to a different forum, but alas 'newbies' post here.
Anyway, I was asked by a friend with a bit of CPU power to optimise the sse2_64 core, a couple of hours later and I have a new core, sse2_64_atom. The development and the mainstay of the testing took place on an Intel Atom, so please forgive the name. However, it should give speed ups on many cpu's, particularly Intel as you all know 4way is still fastest on AMD (although this significantly decreases the gap).
An example on an Intel Atom D525 (dual core),
[2011-06-14 14:18:42] 2 miner threads started, using SHA256 'sse2_64' algorithm.
[2011-06-14 14:18:56] thread 0: 16777216 hashes, 1047.98 khash/sec
[2011-06-14 14:18:19] 2 miner threads started, using SHA256 'sse2_64_atom' algorithm.
[2011-06-14 14:18:31] thread 0: 16777216 hashes, 1234.20 khash/sec
That is about +18% and counting.
You can grab the source
http://digit-labs.org/files/otherstuff/sha256_xmm_amd64_atom.asm.
Benchmarks are much appreciated! although comments/flames are also welcome.