And it was seen to be impossible (by the design of the Algo) to have an Asic
which does not use a lot of very fast memory (or am I wrong with that, TheRealSteve? My GPUs
make heavy use of the graphics memory while mining, I don't believe it's using only 128kB there..
Keep in mind that the memory requirement (at least in the context of LiteCoin) is per hash. So while your card's using a lot of memory, it's only because it's working on multiple hashes at the same time (with decreasing returns). At the same time, the GPU may have to communicate with that memory off-die (out the chip, through the leads, through the memory controller, through more leads, into the GDDR board, often through another controller, onto the RAM, and back, while you could design an ASIC with as much memory as you're willing to drop on there very close to where you're actually doing computation. There's quite impressive speed gains to be had there. Compare it to the various levels of cache on a CPU vs RAM. L1 cache is 3 or 4 cycles (divide 1 second by the core frequency, multiply by 3 or 4, that's how long it takes to poke at it - say a nice round 2.5GHz clock gives 0.4 nanoseconds), RAM takes many times longer (tens of nanoseconds at best - still fast, but obviously quite a bit slower than the L1 cache).
GPUs also have a bit of such localized memory, just not very much of it. I'm guessing that'll change in the future - the 7970 already uses a shiny 768kB of L2 cache (it's not entirely equivalent to CPU levels, suffice it to say it's way faster than the GPU sending bits back and forth to the external (to the chip) on-card memory).
So even though there's less memory and less parallel processing, the processing itself can be done much faster.
Mind you, I'm speaking theoretical - for all I know GridSeed found some actual optimizations (as well).