I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

I only need good statistical properties of the hash function, not cryptographic security.

Pseudorandomness ("good statistical properties") is tied into the cryptographic security:

BLAKE2 aims to provide the highest security level, be it in terms of classical notions as (second) preimage or collision resistance, or of theoretical notions as pseudorandomness (a.k.a. indistinguishability) or indifferentiability.

Relevant terms are 'indistinguishability' and 'indifferentiability'.

Siphash is sufficient for the use case it was designed for because the attacker doesn't know the input to the hash function. It is not the case that Siphash has been cryptanalysed for the case of your application of Siphash in Cuckoo Cycle wherein the attacker knows both the input and output of the hash function.

Note that every submission that Dan Berstein made for cryptographic hash, has always been rejected having breakage. He appears to not excel in that one area. His ARX rounds (from ChaCha symmetric encryption) have been redeployed in Blake2 by those who know how to make it cryptographically secure for the different application of a hash function. I admire Dan a lot. He knows a hell of a lot more than me about everything involving higher maths and cryptography.

In fact siphash-2-4 is already overkill for my purposes, with siphash-4-8 being close to cryptographically secure. Your attack is about as conceivable to me as P being equal to NP, in which case there is no cryptographically secure hash function anyway.

Fact? Based on what published cryptanalysis.

What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.

Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container, and my current algorithms use no such thing.

Agreed 'bucket' would not be an

*ordered* array element. That is a more precise use of the term 'bucket'. We programmers who started hacking since the 1970s and 1980s may be a little bit more loose (employing the general English definition) with our terminology (using 'bucket' to refer to an array element as an array bucket) as compared to someone who came up predominantly in an academic setting and especially someone employing technical English not as their native language wherein they would likely be more strict about definitions. I stand corrected on the more precise terminology.

Well my wink is about using maximum nonce counts, i.e. edge fraction `M/N > 1/2`.

My algorithms break down in that case, since the expected number of cycles becomes non-constant, causing the basic algorithm to overlook an increasing fraction of them, and edge-trimming fails altogether to remove a large fraction of edges.

Exactly.

In a few months, you'll know what I was referring to when it is published.

My calculation was 2^15, that isn't a million.

You need millions to find 2^12 of them wanting to access a single row.

With 2^15 threads you expect 1 access per row.

Correct on 2^15 but not on the conclusion. We both agree the number of memory banks is irrelevant for this concern. Memory banks only determine how many row buffers are active simultaneously, which impacts maximum speed and whether we are memory bandwidth or memory latency bound on speed of execution.

So for the power consumption issue, if we have 2^29 counters (2-bits each) and 2^14 counters per memory page/row, then 2^16 (65536) h/w threads means we can stall threads (perhaps not permanently so as to account for variance outliers) and be roughly assured statistically of roughly 2 accesses per row. That already doubles the hash computation per memory page row.

Thus we don't need millions of h/w threads to convert the algorithm to computation bound on power consumption, and thus making it not ASIC resistant.

At 2^18 (262144) h/w threads we have 8 times more computation per memory page row than the CPU (and the computation will be orders-of-magnitude more power efficient on the ASIC).

A CUDA gpu can apparently do

671 million threads, although this are probably not synced on memory row buffers as we would need here, although the statistical spread might be sufficient without syncing. I think if you had the Kill-A-Watt meter what you would have observed is that by increasing threads, the speed remained topped out at memory bandwidth, but the power consumption of the GPU decreases as threads are increased beyond a certain threshold (assuming GPU threads are very efficient, but that might not be the case).

I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units. The GPU may already be doing this, although maybe not since it is probably optimized for performance and not power consumption where there is a choice, although once it has hit memory bandwidth bound one might hope the GPU would be optimizing for minimizing power consumption by maximizing row buffer coalescing, but this is perhaps too costly to implement in the general case of GPU (but I am arguing not in the case of an ASIC with a specific custom computation circuit). Apparently GPU memory coalescing is not powerful enough to do what we require.

Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads.

Actually how ironic that the 2-bit counters instead of the original basic algorithm, makes the ASIC resistance worse, because more counters fit in a row. Remember I wrote this about "bit array" in my 2013 rough draft paper on memory hard proof-of-work.