You used a lot of double speak.
Nah, it's just you still having some trouble understanding
First I am aware of the space time tradeoff however rather than explain it in every single post it is useful to look at the max scratchpad size. 128KB scratchpad is going to require less memory and less bandwidth than a 16MB scratchpad if everything else is the same.
Let's have a look at the definition of what is "memory hard" in the
scrypt paper:
"A memory-hard algorithm is thus an algorithm which asymptotically uses almost as many memory locations as it uses operations; it can also be thought of as an algorithm which comes close to using the most memory possible for a given number of operations, since by treating memory addresses as keys to a hash table it is trivial to limit a Random Access Machine to an address space proportional to its running time",
"Theorem 2. The function SMixr(B, N) can be computed in 4 * N * r applications of the Salsa20/8 core using 1024 * N * r + O(r) bits of storage"You can see that scrypt is just equally memory hard for all the scratchpad sizes. The ratio between the number of scratchpad access operations and the number of salsa20/8 calculations remains the same.
As for higher parameter value having no effect on the relative performance of CPU, GPU, and FGPA/ASICs that is just false.
What I said was "Increasing the size of the scratchpad is not going to bring any improvements (if by improvements you mean making CPU mining more competitive)". How did it turn into "no effect on the relative performance of CPU, GPU, and FGPA/ASICs"? CPU miners are at a serious disadvantage right now, so the effect on the relative performance must be really significant in favour of CPU in order to turn the tables.
In practice, increasing the size of the scratchpad will make it harder to fit in CPU caches. To mitigate the unwanted latency of random accesses, scrypt uses parameter 'r'. Basically if r=1 (the default for LTC), then the scratchpad is accessed as 128 byte chunks at random locations. If r=8, then the memory accesses are done as 1024 byte chunks at random locations. In the former case, the cache miss penalty is hit once per 128 bytes. In the latter case, the cache miss penalty is hit once per 1024 bytes (the sequential accesses after the first cache miss are automatically prefetched, at least in theory). Having high 'r' value reduces the effect of memory access latency penalty for the CPU. And the latency is not an issue for the GPU in the first place. Additionally, if the CPU has to access the memory, then the memory controller must have enough bandwidth. For example, my Core i7 860 processor currently has ~29 kHash/s performance in cpuminer. And the
STREAM benchmark (built as multithreaded with OpenMP support) shows ~10GB/s of practically available memory bandwidth. These ~10GB/s of memory bandwidth would translate to the theoretical hard hashing speed limit ~38 kHash/s if the CPU caches were not helping. There is not much headroom as I can see, and my processor does not even have AVX2.
Scrypt was designed to be GPU and specialized device resistant. This is important in password hashing as most servers are using CPU and attacker will likely choose the most effective component for brute forcing. By making CPU performance superior it prevents attackers from gaining an advantage. You can test this yourself. Modify cgminer OpenCL kernel to use a higher p value. Around 2^14 GPU relative performance is essentially gone. It is comparable to a CPU throughput. At 2^16 GPU relative performance is falling far behind.
This is confusing, did you actually mean the 'N' value? Please just provide the patches for your changes to cgminer and cpuminer that you used for this comparison.
But in general, the GPU tuning is not easy because there are many parameters to tweak. Poorly selected configuration can result in poor hashing performance even for the LTC scrypt. You can find many requests for help with the configuration in the forum. So your poor performance report does not mean anything.
At 2^20 the GPU never completes.
And surely you can raise the memory requirements so high, that they would make mining problematic on the current generation of the video cards purely thanks to insufficient amount of GDDR5 memory. But guess what? In a year or so, the next generation of video cards will have more memory and suddenly GPU mining will again become seriously better than on the CPU. Designing the algorithm around some magic limits which may become ineffective at any time is not the best idea. The current "small" scratchpad size for scrypt focuses on memory bandwidth instead of relying on artificial limits such as memory size (which can be easily increased, especially in the custom built devices).
You say one one hand that the memory requirement doesn't matter and on the other hand that FPGA are hard because they need lots of memory and wide buses. Well guess what the higher the p value the MORE memory and wider busses that is needed. At 2^14 roughly 128x the max scratchpad size is going to mean 128x as much bandwidth is necessary.
Yes, but only if also backed by roughly 128x more computational power. And likewise, the enormous computational power of FPGA/ASIC must be backed by a lot of memory bandwidth, otherwise it will be wasted.
So the lower the p value the EASIER the job is for FPGA and ASIC builders. They can use less memory and narrower busses that means less cost, less complexity, higher ROI%. Sure one isn't required to use max scratchpad size because one can compute on the fly but once again the whole point to the space-time tradeoff is that the advantage to doing so is reduced.
They can't use slower external memory, because it already needs to be damn fast.
Lastly yes the 128KB is per core but so is the 16MB using the default parameters. If 128KB per core increases memory, bandwidth, and/or die size per core then a 16MB requirement would maker it even harder.
Yes, the absolute hashing speed would just drop significantly with the 16MB scratchpad. But it would drop on CPU, GPU, FPGA or any other kind of mining device.
So yes the parameters chosen by LTC makes it 128x less memory hard than the default.
Sigh. Please just read the definition of "memory hard" in the scrypt paper.
You use circular logic to say the max scratch pad size is irrelevant because one can optimize the size of the scratchpad to available resources. This doesn't change the fact that due to the space-time tradeoff you aren't gaining relative performance. Using a higher max scatchpad requires either more memory and bandwidth OR requires more computation. The throughput on the FPGA, GPU, CPU is going to be reduced. Now if they were all reduced equally it wouldn't matter all that matters is relative not nominal performance.
Wait a second, where does this "reduced equally" come from? The space-time tradeoff just means that if you have a system with excessive computational power but slow memory, then you can still tweak lookup-gap to trade one for another. That is instead of being at a huge disadvantage compared to more optimally balanced system. This kinda "equalizes" the systems with vastly different specs, which is a total opposite of "reduces equally".
However the LTC parameters chosen are horrible for CPU usage. CPU have a limited ability for parallel execution. Usually 4 or 8 independent cores. 128KB per core * 8 = 1MB.
This just means that you don't know much about the CPU mining. The point is that modern superscalar processors can execute more than one instruction per cycle, this is called instruction level parallelism. Also there are instruction latencies to take care of. In order to fully utilize the CPU pipeline, each thread has to calculate multiple independent hashes in parallel. Right now cpuminer calculates 3 hashes at once per thread (or even 6 with AVX2). Now do the math.
That's right today with systems that can install multiple GB for very cheap cost the Scrypt paramters chosen bottleneck performance on a CPU. GPU on the other hand are highly parallel execution engines but they have limited memory and that memory is at a higher cost than CPU have access to.
The memory must be also fast, not just large.
TL/DR
For the external memory, I'm assuming that sufficient size is available to be used for as many cores as practically useful (in this case only the memory bandwidth is an important factor). For the on-chip SRAM memory, the bandwidth should be not a problem as the memory can be tightly coupled with each scrypt core, but the size does matter and can't be large enough (the CPU caches are really small when compared with the DDR memory modules for a reason). The current best performing scrypt mining devices (AMD video cards) are relying on the external memory bandwidth. This FPGA design seems to be essentially a GPU clone.