P2P database storage

Summary

- All network data storage is a map of the form hash(data) -> data
- This can be easily verified (so database bloat can't occur)
- Have each node select a random non-unique id (say 2 bytes)
- Nodes store only data where hash(data) has the same msb's as the client id
- Nodes can decide how many bits to control how much storage
- Create routing tables based on the client ids

As long as the number of nodes is enough, all entries will be stored somewhere and routing tables all requests to be routed.

There needs to be a way to notice that records are missing.

For example, you can't prove that the last block added to the chain is invalid if it contains unknown transactions.

Details

It is assumed that all nodes have the entire block header chain. This should also include recent forks, with some cutoff (say 1024 from the head of the tree).

Block headers are therefore special, you just send the header and it is added to the chain if it meets the simple rules.

The database is a map with keys of "hash(message)" mapping to "message".

Merkle nodes all need to be stored an accessible. Ultimately, they should be stored with the transactions, but need to be obtainable directly.

Each node selects a random id that is N bits long. The longer the id, the less data the node is claiming to have.

If the top N bits of a record's key match a nodes id, then the record should be stored by the node. It can add an extra bit to its id if the data stored exceeds what it is willing to support, and then drop the other records.

For simplicity, I will assume that all nodes have 2 bytes as their ids, but it can be made general.

Node Id: 0x1200
Mask: 0xFF00

means that the node is willing to store data for all ids of the form 0x12xy (for any x, y). As long as the non-zero part of the id is short, there will be many nodes with the same id.

A node then builds up a routing table with entries based on distance to the other nodes. This is the magnitude of the signed difference between the 2 node's ids. Thus the max difference is 32576.

A compressed routing table can be sent with 34 entries.

Diff <= 0x8000
Diff <= 0x6000
Diff <= 0x4000
...
Diff <= 0x0004
Diff <= 0x0003
Diff <= 0x0002
Diff <= 0x0001 (rounding error)
Diff <= 0x0001
Diff <= 0x0000 (always zero)

This table is 34 entries. The end of the table can be excluded once you hit a zero hops, as the rest of the table is guaranteed to be all zeros. This would happen if the node decided to use a shorter id. This means table updates only take 34 bytes (or 68 if hops > 255 hops need to be supported). It is probably easiest to make the table an array of var_ints, but mostly would be a byte per entry.

The table reduces in size by approx sqrt(2) each time.

When a node receives a routing update, it can then update its own table.

If one of its local ranges is a subset of the remote range, then it can update the local range to min(old_local_hops, remote_hops + 1)

The local node could maintain a local distance to all ids, and would update each id entry one at a time.

If this caused its own "output" routing table to change, then it could update its neighbors.

This allows requests for data to be routed to the appropriate node.

When sending data to be stored, the receiver could check the proof as the data is received, and break the connection if the proof is invalid. A hostile node with the proof could keep re-sending the data (assuming it switched ip). However, that only works until the data is stored throughout the network.

For keys which lead to other keys. It is the responsibility of nodes which store the keys to make sure the following on keys are generally available.

For example, if a node stored a record

key-hash -> hash(child1, child2)

It would have to make sure that child1 and child 2 were available. This should be done when the entries are added to its database and periodically.

If child1 or child2 is missing, it would send a missing data broadcast. This includes proof that the key-hash is a valid key. The receiving node would then try to get the data themselves, using the same "missing-data" notification message.

If the receiver cannot send back the data, then its distance to that client-id for that remote node should be set to 255 and the routing table updated. Honest nodes should immediately respond with "unknown" to speed things up. This would update all routing tables back from the actual target node reached.