picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 29, 2014, 09:46:35 AM |
|
Hi,
I'm a software engineer and I want to contribute to the bitcoin[d] development. I'm a data compression expert so I thought that a block compression library would be a good start.
I've read about the "ultimate" blockchain compression where lite nodes keep only relevant leaves of the merkle tree. However as I understand we are still far from having this implemented, also this will not help full nodes, nodes with lots of transactions and all those who needs to keep full blockchain. Therefore my vision is that blockchain compression whilst is not critical now, is a nice-to-have feature.
The library I'm thinking of would be usable for wallets and daemons, both for storage and transmission. The algorithms would require modest computational resources and be friendly to hardware implementation. My quick investigation shows that considerable compression ratios (like 50%) may be achievable.
Is this something worth doing?
Thanks!
|
|
|
|
capsqrl
|
|
May 29, 2014, 10:35:29 AM |
|
The algorithms would require modest computational resources and be friendly to hardware implementation. My quick investigation shows that considerable compression ratios (like 50%) may be achievable. Could you go into more details? Blockchain compression is not exactly a new field of study around here, yet it seems you have identified considerable low-hanging fruit. I don't know much about compression myself, but given that hashes, signatures etc closely resemble random noise, it's hard to imagine how to achieve a 50% compression ratio.
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 29, 2014, 11:47:08 AM |
|
I didn't do a detailed study yet, but speaking of low hangin fruits, what about address dictionary, typical script patterns and database index compression?
Once there's a confirmed interest, I plan to start a detailed study and open a technical discussion. I'd also appreciate links to previous technical discussions (my googling skills apparently are not sufficient to find anything technical beside that "ultimate" thing, which I consider orthogonal to normal compression).
|
|
|
|
apxu
Member
Offline
Activity: 229
Merit: 13
|
|
May 29, 2014, 07:12:01 PM |
|
I didn't do a detailed study yet, but speaking of low hangin fruits, what about address dictionary, typical script patterns and database index compression? Set compression for the hard drive and place bitcoin files on it. And think what is cheaper: disk space or processor work?
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 29, 2014, 09:36:33 PM |
|
I don't think a common disk compression methods are efficient for blockchain. Efficient compression means understanding the underlying structure.
Speaking on cheap vs expensive - I think it's users time that's more expensive than processor time or disk space. We can significantly save time necessary to wait for another wallet sync, or for a new wallet to init.
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 29, 2014, 10:07:43 PM |
|
...also, this would be an optional feature anyway
|
|
|
|
Kluge
Donator
Legendary
Offline
Activity: 1218
Merit: 1015
|
|
May 29, 2014, 10:10:18 PM Last edit: May 30, 2014, 02:34:02 PM by Kluge |
|
I didn't do a detailed study yet, but speaking of low hangin fruits, what about address dictionary, typical script patterns and database index compression? Set compression for the hard drive and place bitcoin files on it. And think what is cheaper: disk space or processor work? Disk space is fairly insignificant. Transmission is the real money/time suck. I bugged Jeff about compressing the bootstrap torrent he maintains. He's open to it... needs more prodding, I think. Even without anything built specifically for Bitcoin, >40% can be cut from the blockchain without requiring anywhere near as much time to decompress (compared to downloading time saved) unless you have a higher-end cable connection or fiber.
|
|
|
|
grau
|
|
May 30, 2014, 04:06:15 AM |
|
I don't think a common disk compression methods are efficient for blockchain. Efficient compression means understanding the underlying structure.
Speaking on cheap vs expensive - I think it's users time that's more expensive than processor time or disk space. We can significantly save time necessary to wait for another wallet sync, or for a new wallet to init.
Block chain data does not compress impressively on a global scale, but indices on addresses and tx hashes do. Bits of Proof stores both the block chain data and supplementing indices in LevelDB and achieves high performance in retrieving transactions referring an arbitrary HD master key, that is why it powers the myTREZOR web wallet. I am sure it could be further optimized with your ideas, so let me know if you'd like to discuss them in that scope.
|
|
|
|
theymos
Administrator
Legendary
Offline
Activity: 5376
Merit: 13410
|
|
May 30, 2014, 05:56:43 AM |
|
blocks/blk*.dat could probably benefit from some sort of compression. These files are reasonably compressible and very large, but they're not accessed very frequently. It might also be good to compress blocks before sending them to peers, especially when they're downloading many blocks. The other files are already compressed about as well as they can be without hurting performance, or small enough that it doesn't matter.
|
1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
|
|
|
apxu
Member
Offline
Activity: 229
Merit: 13
|
|
May 30, 2014, 06:58:52 AM |
|
I don't think a common disk compression methods are efficient for blockchain. Efficient compression means understanding the underlying structure.
I disagree. You will have benefits from compression big chunks of data (such as blk-files), not a small pieces (transactions in blocks) Speaking on cheap vs expensive - I think it's users time that's more expensive than processor time or disk space. We can significantly save time necessary to wait for another wallet sync, or for a new wallet to init.
The main time-expensive routine is verifying ECDSA signatures, not downloading. But we can not (?) eliminate this step on every node. Hmm... May be some checkpoints? Let's say we have bootstrap.dat & all index files up to the block on May,1,2014 And the client has hardcoded hash of this data. So, new user have to download bootstrap&indexes, check hash and... do not verify all signatures from the beginning of bitcoin era
|
|
|
|
theymos
Administrator
Legendary
Offline
Activity: 5376
Merit: 13410
|
|
May 30, 2014, 07:18:32 AM |
|
Hmm... May be some checkpoints? Let's say we have bootstrap.dat & all index files up to the block on May,1,2014 And the client has hardcoded hash of this data. So, new user have to download bootstrap&indexes, check hash and... do not verify all signatures from the beginning of bitcoin era
Bitcoin Core already skips signature verification on blocks before the latest checkpoint.
|
1NXYoJ5xU91Jp83XfVMHwwTUyZFK64BoAD
|
|
|
apxu
Member
Offline
Activity: 229
Merit: 13
|
|
May 30, 2014, 08:11:21 AM |
|
Bitcoin Core already skips signature verification on blocks before the latest checkpoint. But it parses all blocks and transactions to create indexes. Or I am wrong?
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 30, 2014, 08:28:08 AM |
|
I disagree. You will have benefits from compression big chunks of data (such as blk-files), not a small pieces (transactions in blocks)
I didn't say there's no benefit from compressing big chunks. However in this thread I'd like to study interest and requirements rather than benefits of a specific compression technique. I think it's pretty clear that disk compression is not a solution.
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 30, 2014, 08:40:38 AM |
|
I don't think a common disk compression methods are efficient for blockchain. Efficient compression means understanding the underlying structure.
Speaking on cheap vs expensive - I think it's users time that's more expensive than processor time or disk space. We can significantly save time necessary to wait for another wallet sync, or for a new wallet to init.
Block chain data does not compress impressively on a global scale, but indices on addresses and tx hashes do. Bits of Proof stores both the block chain data and supplementing indices in LevelDB and achieves high performance in retrieving transactions referring an arbitrary HD master key, that is why it powers the myTREZOR web wallet. I am sure it could be further optimized with your ideas, so let me know if you'd like to discuss them in that scope. Definitely I'm interested in this. I think optimization of network transmission and storage on portable/battery powered devices are major targets to consider. I agree with comments that disk space on PC is not an issue.
|
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 30, 2014, 08:52:34 AM |
|
The main time-expensive routine is verifying ECDSA signatures, not downloading. But we can not (?) eliminate this step on every node.
Hmm... May be some checkpoints? Let's say we have bootstrap.dat & all index files up to the block on May,1,2014 And the client has hardcoded hash of this data. So, new user have to download bootstrap&indexes, check hash and... do not verify all signatures from the beginning of bitcoin era
ECDSA checking may be expensive, but I open my wallet every few days and on every open I see considerable network traffic for couple of minutes (I'm on ADSL). I wouldn't mind halving the time. Also for bitcoin nodes with lots of down links this might be a bigger problem.
|
|
|
|
maaku
Legendary
Offline
Activity: 905
Merit: 1012
|
|
May 30, 2014, 03:27:33 PM |
|
The issues there are not block chain data size, but rather the network sync algorithm. The bitcoind code that is deployed right now uses a rather stupid algorithm for fetching blocks from peers that results in downloading the same block multiple times. There is a developer working on that right now. Block chain size is currently rate limited to about 1MB per 10 minutes, or about 13kbps. It is not at all clear that compressing this data will result in faster syncs, but if that is what interests you then by all means give it a shot.
Note that there is already a Script compressor which is used by the ultraprune code to greatly reduce the size of common bitcoin scriptPubKeys.
|
I'm an independent developer working on bitcoin-core, making my living off community donations. If you like my work, please consider donating yourself: 13snZ4ZyCzaL7358SmgvHGC9AxskqumNxP
|
|
|
picopicopico (OP)
Newbie
Offline
Activity: 14
Merit: 0
|
|
June 02, 2014, 11:17:09 AM |
|
Block chain size is currently rate limited to about 1MB per 10 minutes, or about 13kbps. It is not at all clear that compressing this data will result in faster syncs, but if that is what interests you then by all means give it a shot.
1. If I sync once a day I need about 3 mins just to fetch the data (250 KiB/s channel, about 200 KiB block, 240 blocks). 2. As I undestand clients propagate new transactions through the network instantly and whenever there's a new block, there's a spike in the amount of data (which may well exceed 13 kpbs). 3. The new block will contain transactions previously propagated through the network, so some (all?) transactions get received by peers twice - when transaction is created, and as a part of a block. Is this correct?
|
|
|
|
apxu
Member
Offline
Activity: 229
Merit: 13
|
|
June 02, 2014, 12:52:20 PM |
|
1. If I sync once a day I need about 3 mins just to fetch the data (250 KiB/s channel, about 200 KiB block, 240 blocks). ~144 blocks 3. The new block will contain transactions previously propagated through the network, so some (all?) transactions get received by peers twice - when transaction is created, and as a part of a block. Is this correct? This is usually correct.
|
|
|
|
maaku
Legendary
Offline
Activity: 905
Merit: 1012
|
|
June 02, 2014, 11:53:45 PM |
|
But the protocol layer can fix that. A block that is just header + coinbase + txid list would be pretty short.
|
I'm an independent developer working on bitcoin-core, making my living off community donations. If you like my work, please consider donating yourself: 13snZ4ZyCzaL7358SmgvHGC9AxskqumNxP
|
|
|
apxu
Member
Offline
Activity: 229
Merit: 13
|
|
June 03, 2014, 03:36:47 AM |
|
But the protocol layer can fix that. A block that is just header + coinbase + txid list would be pretty short.
Yes, but what if I do not have one or more transactions in my mempool to assemble a block from this template? This situation occures when a transaction comes to a miner, miner accepts it into a block and solve block immideately after. Or miner takes very old (month or year) transaction from mempool So, no one node on network has this transaction in mempool. Ok, first node receives "template", and have to ask for missing transaction its peer. https://en.bitcoin.it/wiki/Protocol_specification#getdatagetdata is used in response to inv, to retrieve the content of a specific object, and is usually sent after receiving an inv packet, after filtering known elements. It can be used to retrieve transactions, but only if they are in the memory pool or relay set - arbitrary access to transactions in the chain is not allowed to avoid having clients start to depend on nodes having full transaction indexes (which modern nodes do not). Peer does not have this tx in memory pool - because tx is already in block
|
|
|
|
|