Print Page - Cuckoo Cycle proof-of-work: $5000 in bounties

Title: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: tromp on November 09, 2016, 08:52:37 PM

Cuckoo Cycle is an instantly verifiable memory bound Proof-of-Work,
whose run time is dominated by memory latency.

I invite anyone to try claim one of the following bounties for performance doubling
of my C++/CUDA solvers:

CPU $2000
TMTO $2000
GPU $1000

or half the above bounties for a sqrt(2) performance improvement. Details at

https://github.com/tromp/cuckoo

Happy bounty hunting...

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: tromp on March 03, 2017, 09:31:04 PM

Since the bounties are (partially) backed by bitcoin in the Cuckoo Cycle Bounty Fund,
the recent boost in bitcoin value allows me to increase the bounties by 50%:

performance improvement 4x 2x sqrt(2)x
CPU $6000 $3000 $1500
TMTO $6000 $3000 $1500
GPU $3000 $1500 $ 750

Happy bounty hunting...

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: AlexGR on March 04, 2017, 03:09:55 PM

I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:

perf stat -d -d -d ./cuckoo28

Looking for 42-cycle on cuckoo28("",0) with 50% edges, 7 trims, 1 threads
Using 16MB edge and 32MB node memory, 1-way siphash, and 4-byte counters
initial load 3200%
round 1 partition loads U0 2022% V0 947%
round 2 partition loads U0 560% V0 373%
round 3 partition loads U0 267% V0 201%
round 4 partition loads U0 157% V0 126%
round 5 partition loads U0 103% V0 86%
round 6 partition loads U0 73% V0 63%
round 7 partition loads U0 55% V0 48%
nonce 0: 7 trims completed final load 48%
   4-cycle found at 0:67%
0 total solutions

Performance counter stats for './cuckoo28':

   79535.505140 task-clock (msec) # 0.998 CPUs utilized
   1,567 context-switches # 0.020 K/sec
   2 cpu-migrations # 0.000 K/sec
   5,366 page-faults # 0.067 K/sec
   146,892,319,199 cycles # 1.847 GHz (6.66%)
   104,304,503,009 instructions # 0.71 insn per cycle (13.35%)
   4,551,218,923 branches # 57.222 M/sec (13.35%)
   348,290,653 branch-misses # 7.65% of all branches (6.67%)
   16,992,792,470 L1-dcache-loads # 213.650 M/sec (6.68%)
   1,043,305,676 L1-dcache-load-misses # 6.14% of all L1-dcache hits (6.67%)
   1,034,482,209 LLC-loads # 13.007 M/sec (6.67%)
   698,460,299 LLC-load-misses # 0.95% of all LL-cache hits (6.67%)
   145,872,608,950 L1-icache-loads # 1834.056 M/sec (6.67%)
   11,774,481 L1-icache-load-misses (6.66%)
   16,954,141,954 dTLB-loads # 213.164 M/sec (6.67%)
   384,883,546 dTLB-load-misses # 2.27% of all dTLB cache hits (6.67%)
   104,269,293,009 iTLB-loads # 1310.978 M/sec (13.34%)
   17,157 iTLB-load-misses # 0.00% of all iTLB cache hits (13.34%)
   9,380,542 L1-dcache-prefetches # 0.118 M/sec (6.67%)
   <not supported> L1-dcache-prefetch-misses

   79.674461632 seconds time elapsed

(that's on a q8200 running @ 1.86 GHz). Ins/cycle are seemingly quite low - which is normal-ish for ram tasks, but still).

Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP
(data obtained by running perf top while ./cuckoo28 was running).

Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?

Just some thoughts...

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: ArcCsch on March 05, 2017, 12:17:44 AM

Quote from: tromp on March 03, 2017, 09:31:04 PM

the recent boost in bitcoin value allows me to increase the bounties by 50%:

This seems to imply you are holding your bounties in bitcoin.
Since this is targeted at the bitcoin community, I assume you are also paying them in bitcoin.
Why, then, do you advertise the bounty amounts in dollars instead of in bitcoins?

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: tromp on March 05, 2017, 03:26:51 AM

Quote from: AlexGR on March 04, 2017, 03:09:55 PM

I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:

perf stat -d -d -d ./cuckoo28

Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP
(data obtained by running perf top while ./cuckoo28 was running).

Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?

Indeed, for single threaded use (-t 1, which is the default) one should run the singlethreaded
./cuckoo28st instead. That one doesn't use atomic updates like the cmpxchg.

For multi-threaded use, increasing the number of threads also increases the likelihood of overlooking cycles
due to data-races. It's not clear though at how many threads the use of atomics becomes well-advised.

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: tromp on March 05, 2017, 03:30:36 AM

Quote from: ArcCsch on March 05, 2017, 12:17:44 AM

Quote from: tromp on March 03, 2017, 09:31:04 PM

the recent boost in bitcoin value allows me to increase the bounties by 50%:

Two reasons:
1) I want to limit my financial liability, and if people keep claiming bounties while bitcoin explodes in value, I might need to buy bitcoin at exorbitant prices.
2) I thought the dollar amounts sound more impressive:-)

But, on request, and while supplies last, I'd be happy to make a payout in bitcoin.

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: AlexGR on March 20, 2017, 09:06:14 AM

Quote from: tromp on March 05, 2017, 03:26:51 AM

Quote from: AlexGR on March 04, 2017, 03:09:55 PM

Aha, ok, yes much better now. I haven't got an avx2 cpu so it breaks down when I try to compile everything and I had just some of the files to check out. Btw, I see that there are some versions with very large ram windows, which obviously exceed cache levels. There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case.

Title: Re: Cuckoo Cycle proof-of-work: $5000 in bounties
Post by: tromp on March 20, 2017, 02:04:30 PM

Quote from: AlexGR on March 20, 2017, 09:06:14 AM

There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case.

I spend most time updating random bits, which I access at word (32 bit) granularity.
I tried using the corresponding intrinsic _mm_stream_si32, but it is a 45% slowdown...

Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: tromp on November 09, 2016, 08:52:37 PM