All adresses so far to make it easier to pick the next three recipients: 19tq9NYFsSCMRUkc12v363tHgAeUokoVam noagendamarket 1Mz6PNnCJ1cwVK3qaH66QtCko5oNaHN4a nelisky 1Edvb7z8RARx7F3Y5oi9ZX8tQadq5RfjSD tcatm 1QCNUyy3ViFnMEVTiuMvWgdnJnfYepMibg Macho 1PnuAFsYqmUbMfgqMUh5cNJvT72RtGTBHk bitcoindonor 16k9if6hVQUdA5XmYHqytrSYubdCDX5iXa BiddingPond.com 1AxsH46YUyaxX1fvhixhHqhyjbC15wFfJN Smartzkid 1D3UqLGyEZvFJGnHvPm1hRVC94cpnuEnQr mizerydearia
|
|
|
I'd like to see a market like mtgox for EUR (Paypal and wire transfer). If someone wants to start one I'd even offer to help writing it.
|
|
|
So, what CPU's support this? Is this only the newest AMD ones? And how many systems are we excluding because of this?
Phenoms, i5 and i7 from what I know. Those are the only CPUs that have a 128 bit SSE2 instruction decoder and benefit at all, every older CPU will be slower. Don't think about it as "only works on AMDs K10" but rather as "tweak the compiler to produce the exact assembly code we want and still be flexible to support other vector engines in the future".
|
|
|
iirc, it is possible to specify -march on a per-function basis using some gcc __attribute__. That way, only the function in question would be optimized, and if the user doesn't specify -4way, everything else should be ok.
We only compile one source file with the 4way code (sha256.cpp) using -march=amdfamk10, not the whole client.
|
|
|
... -march=XXXX means the compiler expects the binary will only be run on amdfam10.
That's exactly what we want. But I agree, it's a dirty hack to use -march=amdfam10. In this case it'll produce the most compact and efficient SSE2 code from the source. A cleaner alternative would be inline assembler.
|
|
|
I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken: from main.cpp: for (int j = 0; j < NPAR; j++) { if (thash[7][j] == 0) { for (int i = 0; i < sizeof(hash)/4; i++) ((unsigned int*)&hash)[i] = thash[i][j]; pblock->nNonce = ByteReverse(tmp.block.nNonce + j); } }
The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one. Somethine like this should fix it but it won't be safe at higher difficulties. Also, I'm not sure whether the byte order should be reversed or not. Could someone review this? unsigned int min_hash = ~1; for (int j = 0; j < NPAR; j++) { if (thash[7][j] == 0) { if(thash[6][j] < min_hash) { min_hash = thash[6][j]; for (int i = 0; i < sizeof(hash)/4; i++) ((unsigned int*)&hash)[i] = thash[i][j]; pblock->nNonce = ByteReverse(tmp.block.nNonce + j); } } }
|
|
|
Nice website! I'll add it to my blog bitcoinblogger.com It's jgarzik's, not mine. I only did the design
|
|
|
I've decided to offer a webdesign service in exchange for bitcoins. I'll design or redesign a website with HTML, CSS, Images and Javascript. I'll produce clean and tidy HTML and CSS working in most modern browsers (Firefox, Chrome, Safari, Opera, ...) for almost any template engine. I can also do a little web programming to fit a design to existing backends. If you're interested you can email me at tcatm@gawab.com. References: http://bitcoinwatch.com/http://bitcoincharts.com/http://smsz.net/(edit: added more references)
|
|
|
Sounds interesting 1Edvb7z8RARx7F3Y5oi9ZX8tQadq5RfjSD
|
|
|
So there are about 2000 people generating coins. (1,5Mhash/s average and a 1Ghash/s cluster)
|
|
|
How does the Difficulty number (511) relate to the hash target?
Difficulty = 0xFFFF0000000000000000000000000000000000000000000000000000 / HashTarget For easier calculation you can divide numerator and denominator by 2^193 (i.e. shift right 193 bits).
|
|
|
@satoshi: Oops, I meant -march=amdfam10. Sorry.
@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.
Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There's only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).
|
|
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
Good Will this also work on Windows OS? Didn't try it, but CFLAGS are not OS dependent at all so I guess it'll work.
|
|
|
I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it'll improve performance by ~9%.
|
|
|
Did anyone verify it to produce correct results on 32 bit hosts?
|
|
|
-4way: 12518 khash/s without: 6550 khash/s
It's a little bit slower than my patch (~14000kash/s).
edit: I ran the binary on an older AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ with the same effect we see on older intel cpus: -4way: 1120khash/s without: 2012khash/s
|
|
|
If I didn't know better, I would say the key is the CPU cache size. Seems all the CPU that run slower have 2 MB or less onboard cache, where as the Core i5 starts with at least 3MB of onboard CPU cache.
That's unlikely. The loop accesses 432 bytes of data. That should fit in most caches.
|
|
|
MinGW on Windows has trouble compiling it:
g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp
sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)': sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://www.mingw.org/bugs.shtml> for instructions. make: *** [obj/sha256.o] Error 1
Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
|
|
|
1. Do we know why it doesn't work on 32bit? Is is it because it's using 128bits and if so, would it help if we dropped it to 64?
No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don't have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don't know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old. Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.
|
|
|
1. Does not work on 32-bit (though that's not a problem with the algorithm). 2. Patch is against older SVN. There's a git repo at http://github.com/tcatm/bitcoin-cruncher3. Compiles on every 64bit Linux. It's not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I'm planning a pluggable bitcoinminer someday. But at current difficulty it's easier to work for bitcoins than finding faster ways for mining.
|
|
|
|