I am following up on

the discussion that tromp and I had upthread about whether his Cuckcoo proof-of-work is ASIC resistant, which began as

a post by myself about Ethereum's Dagger proof-of-work.

**For all the following reasons, I have concluded that ASIC resistant proof-of-work algorithms can't exist.** This applies to Cryptonite's (Monero's) proof-of-work hash and Zcash's Equihash. The CPU simply isn't designed to do any one algorithm most efficiently. It makes tradeoffs in order to be adept at generalized computation. The ASIC will always be at least two orders-of-magnitude more power efficient.

In that prior discussion with tromp, I had suggested some strategies of using special hardware threads on the ASIC to coalesce memory accesses in the same memory row and/or queuing hash computations that didn't coalesce to attain greater power efficiency on the ASIC, potentially up to two or three orders of magnitude.

I have also realized that it might be possible to design DRAM circuits with much smaller row banks, which also has the potential for two orders of magnitude power efficiency improvement for this specific Cuckcoo algorithm. General computing DRAM must have large row buffer sizes so as to exploit efficiencies in locality of memory, so it wouldn't be possible to put these specialized DRAMs on general purpose computers.

As these DRAM power costs are shrunk, the power cost of the hash computations becomes a greater proportion, allowing those to be 3 to 4 orders of magnitude more power efficient on the ASIC.

Even if the Cuckcoo algorithm is shrunk to operate on L1 cash and even after I converted it to use up to 32 (16-bit) buckets per slot and track all the cycles and find multiple copies of cycles, so that I could force the consumption of the entire SRAM cache line per random access, still the ASIC will end up probably at least two orders-of-magnitude more power efficient, because for example N-way caches (required for general computation) are not as power efficient as direct-mapped (which is all that is required in this case).

This is not to mention that the ASIC mining farm has another 1/2 to 1 order-of-magnitude advantage on electricity cost compared to the CPU/GPU marginal miners.

Additionally:

1. Accessing a memory set that is larger than the cache(s) incurs the power consumption cost of loading the cache(s) and resetting the pipeline on cache misses[1]. N-way set associative L1 caches typically read in parallel the SRAM rows for all N ways, before the cache miss is detected[2]. On systems with inclusive L2 and L3 caches such as Intel, these are also always loaded. Since each SRAM access incurs 10X power consumption of DRAM (albeit 10X faster) and albeit that the DRAM row size is typically 128X greater[3], the baseline cache power consumption can become significant relative to that of the DRAM; and especially if our algorithm employs locality to amortize DRAM row loading over many accesses (in excess of the locality of the SRAM row size). An ASIC/FPGA could be designed to eliminate this baseline cache power consumption.

2. Due to the aforementioned N-way power consumption overhead, Android's 2-way L1 cache is more power efficient than Haswell's 8-way[4], and both have the same latency.

[1]

https://lwn.net/Articles/252125/[2]

https://en.wikipedia.org/wiki/CPU_cache#Implementation http://personal.denison.edu/~bressoud/cs281-s10/Supplements/caching2.pdf#page=4

[3]

http://www.futurechips.org/chip-design-for-all/what-every-programmer-should-know-about-the-memory-system.html[4]

http://www.7-cpu.com/cpu/Cortex-A15.html https://en.wikipedia.org/wiki/ARM_Cortex-A57#Overview

http://www.7-cpu.com/cpu/Haswell.html

The following is sloppy code I just hacked quickly to do the test I needed.

// Cuckoo Cycle, an attempt for an ASIC-resistant proof-of-work

// a derivative of the original by John Tromp¹

#include "cuckoo.h"

#include <stdbool.h>

#include <time.h>

// algorithm parameters

#define MAXBUFLEN PROOFSIZE*BUCKETS

// used to simplify nonce recovery

int cuckoo[1+SIZE][BUCKETS]; // global; conveniently initialized to zero

uint8_t cycle[1+SIZE];

int nonce, reads=0, len, cycled, count=0;

u16 node, path[PROOFSIZE], buckets[MAXBUFLEN];

bool recordCycle(const int pi) {

// cycled = true;

len = pi;

printf("% 4d-cycle found at %d%%\n", len, (int)(nonce*100L/EASYNESS));

return pi == PROOFSIZE && ++count > 256;

}

// Returns whether all cycles found.

bool walkOdd(const int pi, int bi);

bool walkEven(const int pi, int bi) {

// End of path?

if ((reads++, cuckoo[node][1]) == 0) // iff 1 occupied bucket, it reverses the path pointing back to the node that pointed to this node

return false;

// Cache all the node's buckets so DRAM row buffer loads are not repeated

int i = 0;

do {

buckets[bi+i] = cuckoo[node][i]; // compact into buffer instead of allocating the BUCKETS on stack

} while (++i < BUCKETS && cuckoo[node][i] != 0);

// Walk the path for each bucket

int j = bi;

bi += i;

for (; j < bi; j++) {

node = buckets[j];

// Repeating a cycle or reversing the path?

int k = pi - 2;

while (node != path[k] && --k >= 0);

if (k > 0)

continue;

// Cycle found? (only in WalkEven() bcz bipartite graphs only contain even length cycles, bcz there are no edges U <--> U nor V <--> V)

if (k == 0) {

// All cycles found?

if (recordCycle(pi))

return true;

}

path[pi] = node;

if (walkOdd(pi+1, bi))

return true;

}

return false;

}

bool walkOdd(const int pi, int bi) {

// Path would exceed maximum cycle length?

if (pi > PROOFSIZE)

return false;

// End of path?

if ((reads++, cuckoo[node][1]) == 0) // iff 1 occupied bucket, it reverses the path pointing back to the node that pointed to this node

return false;

// Cache all the node's buckets so DRAM row buffer loads are not repeated

int i = 0;

do {

buckets[bi+i] = cuckoo[node][i]; // compact into buffer instead of allocating the BUCKETS on stack

} while (++i < BUCKETS && cuckoo[node][i] != 0);

// Walk the path for each bucket

int j = bi;

bi += i;

for (; j < bi; j++) {

node = buckets[j];

// Repeating a cycle or reversing the path?

int k = pi - 2;

while (node != path[k] && --k >= 0);

if (k > 0)

continue;

path[pi] = node;

if (walkEven(pi+1, bi))

return true;

}

return false;

}

bool walkThird(const int pi, int bi) {

// End of path?

if ((reads++, cuckoo[node][1]) == 0) // iff 1 occupied bucket, it reverses the path pointing back to the node that pointed to this node

return false;

// Cache all the node's buckets so DRAM row buffer loads are not repeated

int i = 0;

do {

buckets[bi+i] = cuckoo[node][i]; // compact into buffer instead of allocating the BUCKETS on stack

} while (++i < BUCKETS && cuckoo[node][i] != 0);

// Walk the path for each bucket

int j = bi;

bi += i;

for (; j < bi; j++) {

node = buckets[j];

// Repeating a cycle or reversing the path?

int k = pi - 2;

while (node != path[k] && --k >= 0);

if (k > 0)

continue;

path[pi] = node;

if (walkEven(pi+1, bi))

return true;

}

return false;

}

bool walkFirstTwo() {

// End of path?

if (cuckoo[node][1] == 0) // iff 1 occupied bucket, it reverses the path pointing back to the node that pointed to this node

return false;

// Cache all the node's buckets so DRAM row buffer loads are not repeated

int i = 0;

do {

buckets[i] = cuckoo[node][i]; // compact into buffer instead of allocating the BUCKETS on stack

} while (++i < BUCKETS && cuckoo[node][i] != 0);

// Walk the path for each bucket

for (int j = 0; j < i; j++) {

node = buckets[j];

// Reversing the path?

if (node == path[0])

continue;

path[2] = node;

if (walkThird(2+1, i))

return true;

}

return false;

}

int main(int argc, char **argv) {

clock_t start = clock(), diff;

assert(SIZE <= (1 << 14)); // 32KB L1 cache (of u16 elements, note u16 allows up 64KB)

assert(PROOFSIZE > 2);// c.f. `walk(2, 0)`

assert(BUCKETS > 1); // c.f. `(reads++, cuckoo[node][1]) == 0`

char *header = argc >= 2 ? argv[1] : "";

setheader(header);

printf("Looking for %d-cycle on cuckoo%d%d(\"%s\") with %d edges and %d buckets\n",

PROOFSIZE, SIZEMULT, SIZESHIFT, header, EASYNESS, BUCKETS);

int j, u, v, hashes=0, writes=0, maxes=0, max=0;

for (nonce = 0; nonce < EASYNESS; nonce++) {

sipedge(nonce, &u, &v); hashes++;

// Edge already exists?

reads++;

int i = 0;

while (cuckoo[u][i] != v && cuckoo[u][i] != 0 && ++i < BUCKETS);

if (i < BUCKETS && cuckoo[u][i] == v)

continue; // ignore duplicate edges

// Not enough buckets?

if (i == BUCKETS) {

if (i == max) maxes++;

else {

max = i;

maxes = 1;

}

continue;

}

else if (i <= max) {

if (i == max) maxes++;

}

else {

max = i;

maxes = 1;

}

reads++;

for (j = 0; cuckoo[v][j] != 0 && ++j < BUCKETS;) {}

if (j == BUCKETS) {

if (j == max) maxes++;

else {

max = j;

maxes = 1;

}

continue;

}

else if (j <= max) {

if (j == max) maxes++;

}

else {

max = j;

maxes = 1;

}

// Add new edge

cuckoo[u][i] = v; writes++;

cuckoo[v][j] = u; writes++;

cycled = false;

// Search for cycles?

if (i > 0 && j > 0) {

path[0] = u;

path[1] = v;

node = v;

// Found?

if (walkFirstTwo()) {

int pi = len;

// Mark the cycle edges

while (--pi >= 0)

cycle[path[pi]] = true;

// Enumerate the nonces for the marked cycle edge

for (; len; nonce--) {

sipedge(nonce, &u, &v); hashes++;

if (cycle[u] && cycle[v])

printf("%2d %08x (%d,%d)\n", --len, nonce, u, v);

}

break;

}

if (cycled) {

cuckoo[u][i] = 0; writes++;

cuckoo[v][j] = 0; writes++;

}

}

}

printf("Hashes: %d\n Reads: %d\nWrites: %d\n Maxes: %d\n Max: %d\n", hashes, reads, writes, maxes, max);

diff = clock() - start;

int msec = diff * 1000 / CLOCKS_PER_SEC;

printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);

return 0;

}

// ¹https://bitcointalk.org/index.php?topic=1361602.msg15111595#msg15111595

// Cuckoo Cycle, a memory-hard proof-of-work

// Copyright (c) 2013-2014 John Tromp

#include <stdio.h>

#include <stdint.h>

#include <string.h>

#include <assert.h>

void SHA256(const unsigned char *header, size_t len, unsigned char hash[32]);

// proof-of-work parameters

#ifndef SIZEMULT

#define SIZEMULT 1

#endif

#ifndef SIZESHIFT

#define SIZESHIFT 14

#endif

#ifndef EASYNESS

#define EASYNESS (SIZE*16) // controls probability of finding a cycle before completion

#endif

#ifndef PROOFSIZE

#define PROOFSIZE 6

#endif

#ifndef BUCKETS

#define BUCKETS 10

#endif

#define SIZE (SIZEMULT*(1<<SIZESHIFT))

// relatively prime partition sizes

#define PARTU (SIZE/2+1)

#define PARTV (SIZE/2-1)

// Otherwise if (d=gcd(U,V)) > 1, frequencies multiples of d are mirrored

// in both partions, and SIZE/2 effectively shrinks to SIZE/(2*d); because given

// hash(k)=d*hash'(k), U=d*U', and V=d*V', thus d*hash'(k) mod d*U' =

// hash'(k) mod U' and d*hash'(k) mod d*V' = hash'(k) mod V':

// http://pub.gajendra.net/2012/09/notes_on_collisions_in_a_common_string_hashing_function

// http://stackoverflow.com/questions/25830215/how-does-double-hashing-work#comment40410794_25830215

typedef uint64_t u64;

#define ROTL(x,b) (u64)( ((x) << (b)) | ( (x) >> (64 - (b))) )

#define SIPROUND \

do { \

v0 += v1; v1=ROTL(v1,13); v1 ^= v0; v0=ROTL(v0,32); \

v2 += v3; v3=ROTL(v3,16); v3 ^= v2; \

v0 += v3; v3=ROTL(v3,21); v3 ^= v0; \

v2 += v1; v1=ROTL(v1,17); v1 ^= v2; v2=ROTL(v2,32); \

} while(0)

// SipHash-2-4 specialized to precomputed key and 4 byte nonces

u64 siphash24( int nonce, u64 v0, u64 v1, u64 v2, u64 v3) {

u64 b = ( ( u64 )4 ) << 56 | nonce;

v3 ^= b;

SIPROUND; SIPROUND;

v0 ^= b;

v2 ^= 0xff;

SIPROUND; SIPROUND; SIPROUND; SIPROUND;

return v0 ^ v1 ^ v2 ^ v3;

}

u64 v0 = 0x736f6d6570736575ULL, v1 = 0x646f72616e646f6dULL,

v2 = 0x6c7967656e657261ULL, v3 = 0x7465646279746573ULL;

#define U8TO64_LE(p) \

(((u64)((p)[0]) ) | ((u64)((p)[1]) << 8) | \

((u64)((p)[2]) << 16) | ((u64)((p)[3]) << 24) | \

((u64)((p)[4]) << 32) | ((u64)((p)[5]) << 40) | \

((u64)((p)[6]) << 48) | ((u64)((p)[7]) << 56))

// derive siphash key from header

void setheader(const char *header) {

unsigned char hdrkey[32];

SHA256((unsigned char *)header, strlen(header), hdrkey);

u64 k0 = U8TO64_LE( hdrkey ); u64 k1 = U8TO64_LE( hdrkey + 8 );

v3 ^= k1; v2 ^= k0; v1 ^= k1; v0 ^= k0;

}

// generate edge in cuckoo graph

typedef uint16_t u16;

void sipedge(int nonce, u16 *pu, u16 *pv) {

u64 sip = siphash24(nonce, v0, v1, v2, v3);

*pu = 1 + (u16)(sip % PARTU); // "1 +" bcz 0 is the "no edge" ("empty cell or ⊥"¹⁰) value

*pv = 1 + PARTU + (u16)(sip % PARTV);

}