You need millions to find 2^12 of them wanting to access a single row.
With 2^15 threads you expect 1 access per row.
Correct on 2^15 but not on the conclusion.
With 2^27 you expect 2^12 accesses per row. That's hundreds of millions.
That would give you 2^12 hashes per row buffer load. That isn't necessary to break ASIC resistance. 2^18 (256k threads) gives 8 hashes per row buffer load and might be enough to make it computation bound to a range of perhaps 100-to-1 improvement of power consumption on the ASIC.
Or in any case, 2^20 (million threads) gives 32 hashes per row buffer.
And remember my use case is 2^20 counters not 2^29, so in my use case the ASIC only needs 1000s of threads to make it computation bound up to the limits of the maximum efficiency advantage of the ASIC.
A CUDA gpu can apparently do 671 million threads
Not in hardware. It can only run a few thousand in hardware (simultaneously).
So the 671 have to run in thousands of batches, one after the other.
Yeah I also stated in the prior post that the GPU isn't likely optimized enough to cause major advantage. The ASIC is the big concern.
I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation
Your ASIC would require an insanely huge die and be orders of magnitude more expensive than the DRAM it tries to optimize access to.
Certainly not in my use case of 2^20 counters.
And in the 2^29 counters use case, please remember I said that you don't need millions of compute units. The compute units can be shared by the non-stalled threads and even these can be pipelined and queued. So you don't need millions of compute units circuits. The millions is only the logical threads.
and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units.
How would you know what to stall without doing its computation first??
That is irrelevant. Think it out. It is a statistical phenomenon. Not all the computations are computed monolithically followed all the syncs. These stages are happening in parallel for 2^8 memory banks (row buffers), so we can have many threads stalled yet some not. And even the computations can be pipelined and queued. Let's not conflate power consumption goal with speed goal.
It is very very difficult to defeat the ASIC. I think I know how to do it though.
Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads.
For such small instances, latency is not that relevant as it all fits in SRAM cache.
Latency is a speed bound concern. Our concern herein is power consumption bound. That is what I am getting you to focus on these being orthogonal issues.
The ASIC can choose to use DRAM instead and amortize the power consumption over the row buffer. So I am not yet sure if that would give the ASIC an advantage. Although I've read that SRAM is 10X faster and 10X more power consumption (thus being neutral relative to the DRAM), I am wondering if the ASIC can coalesce (sync) all the reads in the row buffer for DRAM, if the DRAM might not have a significant power advantage thus rendering it computation bound for power consumption.
Do you have any idea? I was planning to study the equations for memory power consumption in the reference I cited upthread.
Note I think on Android it may be possible to turn off the cache.