I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:
perf stat -d -d -d ./cuckoo28
Looking for 42-cycle on cuckoo28("",0) with 50% edges, 7 trims, 1 threads
Using 16MB edge and 32MB node memory, 1-way siphash, and 4-byte counters
initial load 3200%
round 1 partition loads U0 2022% V0 947%
round 2 partition loads U0 560% V0 373%
round 3 partition loads U0 267% V0 201%
round 4 partition loads U0 157% V0 126%
round 5 partition loads U0 103% V0 86%
round 6 partition loads U0 73% V0 63%
round 7 partition loads U0 55% V0 48%
nonce 0: 7 trims completed final load 48%
4-cycle found at 0:67%
0 total solutions
Performance counter stats for './cuckoo28':
79535.505140 task-clock (msec) # 0.998 CPUs utilized
1,567 context-switches # 0.020 K/sec
2 cpu-migrations # 0.000 K/sec
5,366 page-faults # 0.067 K/sec
146,892,319,199 cycles # 1.847 GHz (6.66%)
104,304,503,009 instructions #
0.71 insn per cycle (13.35%)
4,551,218,923 branches # 57.222 M/sec (13.35%)
348,290,653 branch-misses # 7.65% of all branches (6.67%)
16,992,792,470 L1-dcache-loads # 213.650 M/sec (6.68%)
1,043,305,676 L1-dcache-load-misses # 6.14% of all L1-dcache hits (6.67%)
1,034,482,209 LLC-loads # 13.007 M/sec (6.67%)
698,460,299 LLC-load-misses # 0.95% of all LL-cache hits (6.67%)
145,872,608,950 L1-icache-loads # 1834.056 M/sec (6.67%)
11,774,481 L1-icache-load-misses (6.66%)
16,954,141,954 dTLB-loads # 213.164 M/sec (6.67%)
384,883,546 dTLB-load-misses # 2.27% of all dTLB cache hits (6.67%)
104,269,293,009 iTLB-loads # 1310.978 M/sec (13.34%)
17,157 iTLB-load-misses # 0.00% of all iTLB cache hits (13.34%)
9,380,542 L1-dcache-prefetches # 0.118 M/sec (6.67%)
<not supported> L1-dcache-prefetch-misses
79.674461632 seconds time elapsed
(that's on a q8200 running @ 1.86 GHz). Ins/cycle are seemingly quite low - which is normal-ish for ram tasks, but still).
Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?):
http://imgur.com/a/zwEmP(data obtained by running perf top while ./cuckoo28 was running).
Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?
Just some thoughts...