Title: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: tromp on November 09, 2016, 08:52:37 PM Cuckoo Cycle is an instantly verifiable memory bound Proof-of-Work,
whose run time is dominated by memory latency. I invite anyone to try claim one of the following bounties for performance doubling of my C++/CUDA solvers: CPU $2000 TMTO $2000 GPU $1000 or half the above bounties for a sqrt(2) performance improvement. Details at https://github.com/tromp/cuckoo Happy bounty hunting... Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: tromp on March 03, 2017, 09:31:04 PM Since the bounties are (partially) backed by bitcoin in the Cuckoo Cycle Bounty Fund,
the recent boost in bitcoin value allows me to increase the bounties by 50%: performance improvement 4x 2x sqrt(2)x CPU $6000 $3000 $1500 TMTO $6000 $3000 $1500 GPU $3000 $1500 $ 750 Happy bounty hunting... Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: AlexGR on March 04, 2017, 03:09:55 PM I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:
perf stat -d -d -d ./cuckoo28 Looking for 42-cycle on cuckoo28("",0) with 50% edges, 7 trims, 1 threads Using 16MB edge and 32MB node memory, 1-way siphash, and 4-byte counters initial load 3200% round 1 partition loads U0 2022% V0 947% round 2 partition loads U0 560% V0 373% round 3 partition loads U0 267% V0 201% round 4 partition loads U0 157% V0 126% round 5 partition loads U0 103% V0 86% round 6 partition loads U0 73% V0 63% round 7 partition loads U0 55% V0 48% nonce 0: 7 trims completed final load 48% 4-cycle found at 0:67% 0 total solutions Performance counter stats for './cuckoo28': 79535.505140 task-clock (msec) # 0.998 CPUs utilized 1,567 context-switches # 0.020 K/sec 2 cpu-migrations # 0.000 K/sec 5,366 page-faults # 0.067 K/sec 146,892,319,199 cycles # 1.847 GHz (6.66%) 104,304,503,009 instructions # 0.71 insn per cycle (13.35%) 4,551,218,923 branches # 57.222 M/sec (13.35%) 348,290,653 branch-misses # 7.65% of all branches (6.67%) 16,992,792,470 L1-dcache-loads # 213.650 M/sec (6.68%) 1,043,305,676 L1-dcache-load-misses # 6.14% of all L1-dcache hits (6.67%) 1,034,482,209 LLC-loads # 13.007 M/sec (6.67%) 698,460,299 LLC-load-misses # 0.95% of all LL-cache hits (6.67%) 145,872,608,950 L1-icache-loads # 1834.056 M/sec (6.67%) 11,774,481 L1-icache-load-misses (6.66%) 16,954,141,954 dTLB-loads # 213.164 M/sec (6.67%) 384,883,546 dTLB-load-misses # 2.27% of all dTLB cache hits (6.67%) 104,269,293,009 iTLB-loads # 1310.978 M/sec (13.34%) 17,157 iTLB-load-misses # 0.00% of all iTLB cache hits (13.34%) 9,380,542 L1-dcache-prefetches # 0.118 M/sec (6.67%) <not supported> L1-dcache-prefetch-misses 79.674461632 seconds time elapsed (that's on a q8200 running @ 1.86 GHz). Ins/cycle are seemingly quite low - which is normal-ish for ram tasks, but still). Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP (data obtained by running perf top while ./cuckoo28 was running). Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption? Just some thoughts... Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: ArcCsch on March 05, 2017, 12:17:44 AM the recent boost in bitcoin value allows me to increase the bounties by 50%: This seems to imply you are holding your bounties in bitcoin.Since this is targeted at the bitcoin community, I assume you are also paying them in bitcoin. Why, then, do you advertise the bounty amounts in dollars instead of in bitcoins? Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: tromp on March 05, 2017, 03:26:51 AM I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28: perf stat -d -d -d ./cuckoo28 Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP (data obtained by running perf top while ./cuckoo28 was running). Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption? Indeed, for single threaded use (-t 1, which is the default) one should run the singlethreaded ./cuckoo28st instead. That one doesn't use atomic updates like the cmpxchg. For multi-threaded use, increasing the number of threads also increases the likelihood of overlooking cycles due to data-races. It's not clear though at how many threads the use of atomics becomes well-advised. Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: tromp on March 05, 2017, 03:30:36 AM the recent boost in bitcoin value allows me to increase the bounties by 50%: This seems to imply you are holding your bounties in bitcoin.Since this is targeted at the bitcoin community, I assume you are also paying them in bitcoin. Why, then, do you advertise the bounty amounts in dollars instead of in bitcoins? Two reasons: 1) I want to limit my financial liability, and if people keep claiming bounties while bitcoin explodes in value, I might need to buy bitcoin at exorbitant prices. 2) I thought the dollar amounts sound more impressive:-) But, on request, and while supplies last, I'd be happy to make a payout in bitcoin. Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: AlexGR on March 20, 2017, 09:06:14 AM I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28: perf stat -d -d -d ./cuckoo28 Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP (data obtained by running perf top while ./cuckoo28 was running). Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption? Indeed, for single threaded use (-t 1, which is the default) one should run the singlethreaded ./cuckoo28st instead. That one doesn't use atomic updates like the cmpxchg. For multi-threaded use, increasing the number of threads also increases the likelihood of overlooking cycles due to data-races. It's not clear though at how many threads the use of atomics becomes well-advised. Aha, ok, yes much better now. I haven't got an avx2 cpu so it breaks down when I try to compile everything and I had just some of the files to check out. Btw, I see that there are some versions with very large ram windows, which obviously exceed cache levels. There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case. Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties Post by: tromp on March 20, 2017, 02:04:30 PM There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case. I spend most time updating random bits, which I access at word (32 bit) granularity. I tried using the corresponding intrinsic _mm_stream_si32, but it is a 45% slowdown... |