you are not talking about a few bits of memory to store, but a lot more...
Yeah. I know. 128kB (1,024,000 bits) per hash per core, if I remember correctly, which would require 1,024,000 flip flops (plus some overhead from the various calculations that have to be done). Remember that memory usage only starts to get gigantic when you run multiple cores. Each modern ASIC chip has as many cores as they could fit on the die. As for GPUs, my 7970, for example, has 2048 cores. 2048 times 128kB equals just over 262mB, which is right in the ballpark of how much GPU ram it uses while hashing. Divide the number of hashes/second (700kh/s) it runs by the number of cores (2048), then by the number of memory accesses per hash (I forgot), then take the reciprocal of that (divide one by it), and I'd bet you would come up with a figure somewhere close to the memory latency of the GPU's ram (googled for an hour, still can't find it).
But can you fit that many flip flops on a chip, you might ask? Well, I figure, if you can fit it on an FPGA, you can fit it on an ASIC. Look at
page 10-11 of this user manual. According to that, you can get up to 2,443,000 flip flops on an FPGA, and that's just that brand/series alone, so it can certainly be done on an ASIC.