When I'm done writing a GPU implementation, I'll be quoting performance in solutions/second.
Sure, that makes most sense.
But as Jonathan Toomim pointed out in the Zcash forum,
and as SolarDiz agreed with
Sol/s and Sol/(GiB•s) are two distinct metrics. Both are important.
I simply chose the somewhat less common measure because
1) for a memory-hard algorithm, it's crucial that you cannot trade off memory for speed without penalty.
2) i didn't want to disclose the precise Sol/s performance of my miner.
As for space for indexes, 21 bits isn't that significant. After the first sort round, you need 2*21 bits + 180 bits hash = 222 bits. After the 2nd round, you need 4*21 bits + 160 bits hash = 244 bits, which fits nicely in 32-byte fixed size records. By the 3rd round the size of the indexes overtakes the hash size, but the number of hashes diminishes.
And by the 8th round you need 256*21 bits + 40 bits =5416 bits = 677 bytes.
That's where you need to get creative.
edit: here's a paper by DJB that discusses optimizing Wagner's algorithm.
I'm familiar with some of DJB's papers on wagner's algorithm, although not that particular one.