on indexing random data

In IRC, we were talking about new index formats for the block chain. Since most of the things we want to index by are random, this gets hairy.

I'm thinking of a tree of lists structure. It should have pretty good performance even in the worst case, and be pretty safe under IO failures.

In practice, this is a collection of 4k chunks. Each chunk starts with a header of some sort, probably very basic. At a minimum, each chunk needs to indicate what type of chunk it is, and a chunk ID will make it very easy to keep a journal (see below). 4k was chosen because it is the native block size of modern hard drives. It is also the native page size for virtual memory in lots of CPUs. Memory mapped file IO should be FAST, not counting syncs.

The first chunk will be a tree chunk. After the header, it has 64 pointers, each one 32 bits long. These pointers are to chunk numbers inside the index file. The first pointer is for the leftmost branch of a binary tree, corresponding to 000000, the second is 000001, the third is 000010, etc. Lookups are very fast, simply isolate the 6 bits that are important, multiply by 4 and add the header size. This could also be a different chunk type, a "root chunk" with extra metadata, but would mostly act like a tree chunk.

If a pointer is to another tree chunk (identified by the header) that subtree covers the next 6 bits, etc, etc. In the block chain example, this means that there will be 5 mostly empty tree chunks, and the tree will be somewhat unbalanced. The waste shouldn't be too bad, and a special case could skip through them. If using this system for keys (or really anything but block header hashes) it will be more balanced.

If the pointer is to a list chunk, that chunk has a list of not more than 62 or 63 pointers (it depends on the size of the header) into the full block structure, each pointer is 64 bits long, just a byte offset. Lists aren't necessarily sorted, but by the time you get there, it is just a matter of a few dozen comparisons.

Any time an insert would cause a list chunk to hit a high water mark, it triggers a reorg. 64 new list chunks and a new tree chunk are written and items from the old list chunk are distributed among the new lists. Once those are done and synced, the parent tree chunk would be overwritten with a pointer to the new tree chunk. Then the old chunk is marked available. Since chunks are all the same size, the structure won't fragment. Worst case is a crash just before updating the old chunk, which would leave 65 orphans. Garbage collection should be trivial though, and the free chunk list easy to maintain.

For extra safety, writes could be done with a journal chunk or two, for example, reserve chunk #2 as the permanent journal, and any time you need to write, write there first, and then write the real chunk. For performance, you'd probably want dozens or hundreds of journal chunks. If they have IDs, any time you open the file, you can just replay anything you find there. Zero the whole journal when you are done with it so that future writes can just start at the beginning of the journal and go in (write) order.

This system could be extended with a recent index buffer in addition to the main index. This would be a third chunk type, but basically just a list chunk. The difference would be a pointer back to the previous buffer chunk. The buffer would be checked first, and the size of the buffer could be dynamic. If your node mostly looks up recent blocks, most lookups will be buffer hits, so keep more buffer chunks. If you get mostly buffer misses, you are wasting your time checking it, so keep fewer buffer chunks.

When the system is IO bound, the buffer could even have other uses. For example, when used for the blockchain, during initial downloads, the buffer chain could be allowed to grow to unusual size, and kept mostly in memory, with completed chunks written only when full. In a crash, only the most recent few dozen indexes would be lost. When the node was caught up, it could then replay the buffers starting from the oldest chunk, and only freeing them after their contents were fully committed to the main structure. This might cause a hole in the structure, but only once, and defragmentation would be easy.