Thx; tc seams clear to me now. Do we use multiple calcs per shader (higher tc than shaderz) because we try to get lucky and wanna see the chance, that we could get out multiple useful values with a bulk mem transfer?
But what exactly does Lookup gap?
I cant belive its easyer to regenerate the scratchpad for lookups, than to wait for the mem access (i know random mem access takes extremly long, but still regenerate scratchpad??)
Several threads per Stream Processor, apparently.
I'm just guessing: GPU could work faster but VRAM random access speed is the limiting factor
(entire VRAM is used after all, and Lookup gap > 1 means you have to write+read several times
instead of once)
Performance hit from waiting for access of twice as much mem (hyper memory, or regular ram instead of cache)
would be worse than Lookup gap -- double Lookup gap and you can always get 50% less ram use for ~50% performance hit.
This would be the same for hypothetical Asics and FPGA.
CPU is more interesting. LTC's 128KB scrypt implementation fits in L2 cache, but L3 cache is almost as fast.
http://www.xbitlabs.com/articles/cpu/display/core-i7-3770k-i5-3570k_2.htmlhttp://www.sisoftware.net/?d=qa&f=gpu_mem_latency&l=fr&a=i7 8MB L3 cache / normal DDR3 ram 10 times faster (latency ~4ns / ~40ns)
HD6850 VRAM / Llano shared DDR3 ram <2 times faster (random access pattern test 703ns / 1110ns)
HD6850 256kB L2 cache / HD6850 VRAM 2 times faster (365ns / 703ns)
So, 1 or 2MB lookup table size would be the "sweet spot" for most cpus? Even celerons have 2MB L3 cache...