Using compact indexes instead of hashes as identifiers.

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Using compact indexes instead of hashes as identifiers.

February 26, 2016, 03:50:34 AM

Merited by ABCbits (2)

#1

Quote from: gmaxwell on February 26, 2016, 03:18:22 AM

Quote from: jl777 on February 26, 2016, 02:13:40 AM

An implementation idea would be to assume that SIGHASH_ALL is used for all these, it seems that other SIGHASH modes are rarely used and not sure it makes sense to support them for radically new usecases.

Well everything I described there is completely compatible with all scripthash types; so no real need to limit flexibility. Assuming the sighash code is separate and reused, it shouldn't increase implementation complexity.

Quote

Also, is there a reason that a unique number be used to identify each txid and even output script (address)? To my thinking, after a block is past any chance of being reorganized, then there is a canonical ordering of all blocks, and therefore all tx and vins and vouts.

Since each spend currently requires the 32byte txid and vout, mapping this to 4 or 6 bytes creates a lot of space savings. There is a lot of redundancy in the blockchain with each txid potentially being duplicated once for each vout.

That could be done-- but there is no need to do so normatively. E.g. Peers could agree to transmit data using these compressed indexes, while still hashing the original values. This has the advantage of having the IDs not change out from under already authored transactions, and making sure that offline devices don't need access to (or to trust) external index providers.

Quote

The other big redundancy are reused addresses, which are actually rmd160 hashes inside spend scripts. Using the canonical ordering of everything, then each rmd160 hash would map to a 32bit index and with the vast majority of scripts being standard a few more bytes can be encoded into a few bits. However, with the coming diaspora of p2sh script proliferation, it could be that address numbering wont be so effective in the future.

This would undermine the privacy/fungibility properties of bitcoin as a whole to incentivize parties to reuse addresses. Bitcoin's privacy strongly depends on non-reuse, and additional pressure against it for the benefit of saving on the order 12 bytes per txout doesn't sound like a win-- especially as it would be used to justify bad software, bad businesses practices, and bad public policy that force users to reuse. Access to those indexes would have to be handled quite carefully since if you paid the wrong one the funds would be stolen.

Thanks for your thoughts!

I used to think BTC blockchains' exponential growth was a big problem. Then I realized I could distill it to one quarter the size and also sync in parallel at bandwidth speeds. Now I dont worry about blockchain bloat so much.

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes. A lot of bitcoin processing is slow due to a variety of reasons. Being able to do computations via standard integers allows for an order of magnitude faster operations in many cases. I think the usage of a DB is the cause of most of the slowdown. Since the blockchain mostly never changes, there is no need for using ACID compliant DB operations for everything. With my design all nodes are doing txindex=1 and it takes very little time to "rescan" a new privkey as everything already has the tx already in linked lists and hash tables.

The assumption is that each node is maintaining the canonical indexes, otherwise trust is required. With that assumption, this indexing system can be viewed as a highly efficient lossless codec, so not sure why you consider it bad software. I assume you dont consider using gzip a bad practice? I try hard to preserve compatibility with the existing bitcoin core. What I describe does not change the network protocol at all, that is a separate post

I am not advocating to change the signed tx to be based on 32 bit numbers! All I recommend is to encode them internally (when beyond the practical chance of being reorged) as 32 bits. All nodes will end up with the same numbering. However, all external interactions continue with the fully expanded form, else it wouldnt be compatible at all. All bundles and subsets of bundles would have verifiable hashes and this will create an efficient encoding for permanent archival storage, with arguably isomorphic behavior to the current raw blockchain inside a DB. The tx that are signed and verified are of course using the fully expanded form. However, if you just want to explore the blockchain, nobody really cares about the exact txid, just that funds went from A to B.

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync. I am currently debugging the unspents handling, but the rest is basically working and just needs validation. And this part is derived from MGW which has been in service for a couple years.

I wanted to use the standard bitcoin RPC test suite and I assume what is in the repo is the most complete set? Is there some bruteforce RPC tester that will use large parts of the RPC that can be used as a validation step for new implementation of bitcoin core?

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

gmaxwell

Moderator
Legendary

Offline

Activity: 4158
Merit: 8382

Re: Using compact indexes instead of hashes as identifiers.

February 26, 2016, 05:09:46 AM

Merited by ABCbits (3)

#2

[I split the thread, because we were getting away from talking about the compact aggregated signatures; into things which don't change the network behavior]

Quote from: jl777 on February 26, 2016, 03:50:34 AM

In fact they might be required to be public.

Something we must fight against if we want Bitcoin to retain it's utility in the future. No one uses a system of money with fungibility and privacy as bad as a bitcoin that has lost it's properties.

Quote

in many cases. I think the usage of a DB is the cause of most of the slowdown. Since the blockchain mostly never changes, there is no need for using ACID compliant DB operations for everything.

We do not store the blockchain in a database in Bitcoin Core.

Quote

With my design all nodes are doing txindex=1 and it takes very little time to "rescan" a new privkey as everything already has the tx already in linked lists and hash tables.

txindex=1 is not very compatible with scalablity; you end up with an ever-growing index. It makes some operations faster, but at a considerable cost.

In Bitcoin core reindex spends a fair amount of time rehashing things to check data integrity, because later it is assumed to be true. It could easily be made very fast-- but at the cost of additional indexes. Right now the marginal cost of a full node wallet that supports arbritary key import is about 65GB and rapidly growing. (Pruned vs non-pruned plus txindex).

Quote

so not sure why you consider it bad software. I assume you dont consider using gzip a bad practice?

The "bad software" wasn't referring to local optimizations by any means. What I mean is that someone who doesn't like privacy can release a wallet which forces its users to always reuse addresses and then justify it with "reuse is more efficient because things use the indexes"-- but even this would only apply if they were used 'on the wire'; if they're purely local, they're purely local. The txindex in Bitcoin Core works internally with compact indexes in a similar way, in fact.

I'm super supportive of purely local optimizations.

Quote

(when beyond the practical chance of being reorged) as 32 bits.

If it's purely local, you could probably always number that way, so long as you had code to fix up the numbering during reorg.

Quote

I wanted to use the standard bitcoin RPC test suite and I assume what is in the repo is the most complete set? Is there some bruteforce RPC tester that will use large parts of the RPC that can be used as a validation step for new implementation of bitcoin core?

There is an additional test harness built from BitcoinJ that simulates a bitcoin network and puts the node through its paces. It's called blocktester. Look at the travis ci configuration in Bitcoin core to get an idea of the tests that are automatically run on it.

Standard warnings about the near impossibility of writing new code which is consensus compatible with other code apply-- if you don't find some new surprising behavior in the network that we don't have tests for, you're almost certainly not testing hard enough to have a chance at it.

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

February 26, 2016, 06:18:51 AM

#3

Quote from: gmaxwell on February 26, 2016, 05:09:46 AM

Something we must fight against if we want Bitcoin to retain it's utility in the future. No one uses a system of money with fungibility and privacy as bad as a bitcoin that has lost it's properties.

Of course and I dont believe anything I have done would cause things to get any worse. Just trying to create a scalable framework based on bitcoin that can also be privacy enhanced.

Quote

We do not store the blockchain in a database in Bitcoin Core.

Ok, so storing the raw network data in a raw file isnt a database, my point was that to do most all RPC, the DB is involved and I think it is fair to say that to most people the RPC path is what is used.

Quote

txindex=1 is not very compatible with scalablity; you end up with an ever-growing index. It makes some operations faster, but at a considerable cost.

In Bitcoin core reindex spends a fair amount of time rehashing things to check data integrity, because later it is assumed to be true. It could easily be made very fast-- but at the cost of additional indexes. Right now the marginal cost of a full node wallet that supports arbritary key import is about 65GB and rapidly growing. (Pruned vs non-pruned plus txindex).

Certainly txindex=1 adds overhead, but the method of creating unchanging bundle files each that has hashtables to find everything internally and where all unspents to the same destination are in a linked list, the overhead is contained within each bundle. I dont see why anything has to ever be verified again once it has been verified and put into a read-only file. For the paranoid, a hash of each bundle file could be made and verified before using it.

The tracking of the status of unspents is the only global data that needs to be updated and it is a write once for each entry.

Over time, an operation might need to scan all bundles, but I just got benchmarks on a 1.4Ghz i5, taking 2.5 milliseconds to scan 1900 bundles (for a coin using 500 block bundles). So it is not very fast, but would be fast enough for most use cases and comparable to RPC overhead where things are serialized with system locks. And this with the ability to find all utxo for any address. Also note the parallel design makes multithreading ideal and speedups using multiple cores would be trivial to do.

If even more performance is needed, each speed boosted address can just keep things cached.

Without sigs, my data set for txindex=1 is about 25GB uncompressed and 15GB compressed. I am thinking of defaulting to not storing the sigs, as once a spend is validated and the bundle is marked as sigs and inputs verified, I dont see so many cases where the sigs are needed anymore. Certainly not at the cost of doubling the total space.

Quote

If it's purely local, you could probably always number that way, so long as you had code to fix up the numbering during reorg.

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

Then by keeping the full sized form of the data for N blocks, I think the advantage of compactness is achieved without any negatives [with the caveat that any reorg past the N blocks requires to regenerate the bundle and everything past it], and these are all purely local optimizations.

I was wondering if it was possible to get a networkservices bit? That would indicate a node is available for tx based queries for other nodes. Since txindex=1 comes for free I figure it would be a good thing to make available for other nodes, in addition to sharing hashes about the bundles. This would make things not purely local, but it would operate orthogonally to existing peers and just help bootstrap new nodes and lightweight nodes, but it would only be affecting other nodes that have this bit set. I saw in the docs to say to ask about such bits or just start using one. Not sure the exact process to get a bit allocated

Quote

There is an additional test harness built from BitcoinJ that simulates a bitcoin network and puts the node through its paces. It's called blocktester. Look at the travis ci configuration in Bitcoin core to get an idea of the tests that are automatically run on it.

Fantastic! this will help a lot. thank you

Quote

Standard warnings about the near impossibility of writing new code which is consensus compatible with other code apply-- if you don't find some new surprising behavior in the network that we don't have tests for, you're almost certainly not testing hard enough to have a chance at it.

I know. Unfortunately I am used to doing the impossible stuff and external events forced me on this path in the last months. I started the iguana iteration last November.

The first production iteration wont be a mining enable version, my target audience (mass market) isnt really having so many ASIC's handy. So by following the network consensus it avoids having to recreate all the intricate consensus rules for all the variations in the installed base.

I can recommend everyone to write their own bitcoin core equivalent. It sure is a good way to learn how it all works

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

February 26, 2016, 06:32:17 AM

#4

A side note.

By combining each readonly bundle with balances for all the addresses, it will be possible for validating any consecutive number of bundles by iterating through all the tx in each bundle and updating balances.

Regardless of where the iterations are started, all paths arriving at the same place will have the identical ledger.

Using the method described above, every ledger balance can be verified. Kind of like auditing quarterly/annual financials. Once the books are closed, then only historians need to worry about each tx.

With the data so much more compact, the urgency is not as great, but at some point it will be nice to simply accept a set of starting balances for a given bundle along with unspents and sync from there.

I am hoping you know a way for these balances of a recent bundle to be trustable without having to replay all the bundles. Maybe using aggregated signature or pederson commitments or <your favorite crypto magic> can prove that the balances of the latest bundle is indeed matching the sum of the changes from all the previous bundles? If that is possible, then the "only" thing that would be needed is a way to know that the contents of a bundle can be trusted.

Maybe just a set of hashes that are propagated would be enough for that, but I am sure there are more secure ways. A large amount of processing would be justifiable to create such a ledgerchain that is verifiable.

Thanks for all the advices!

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

marcus_of_augustus

Legendary

Offline

Activity: 3920
Merit: 2348

Eadem mutata resurgo

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 11:05:04 AM

#5

Quote

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

For reference, there is a protocol rule which prohibits generated BTC (coinbase tx) from being spent until it has 100 confirmations. Older Satoshi client had an additional safety margin: it won't let you spend generated BTC until it has 120 confirmations. Discussion surrounding locking coins for sidechains was in the 144-288 blocks depth range.

Monetary Freedom - an inalienable right
Per aspera ad astra

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 11:14:05 AM

#6

Quote from: marcus_of_augustus on March 03, 2016, 11:05:04 AM

Quote

What number of blocks is beyond the common sense reorg danger? I know theoretically very long reorgs are possible, but from what I have seen it is not so common for any big reorg. I think by setting a delay of N blocks before creating the most recent bundle, then the odds of having to regenerate that bundle would be very low. Since it takes about a minute to regenerate it, it isnt a giant disaster if it has to, but best to default things to almost never has to do it.

For reference, there is a protocol rule which prohibits generated BTC (coinbase tx) from being spent until it has 100 confirmations. Older Satoshi client had an additional safety margin: it won't let you spend generated BTC until it has 120 confirmations. Discussion surrounding locking coins for sidechains was in the 144-288 blocks depth range.

thanks. maybe 128? That is close to a full day and power of 2

I dont know the formal to calculate the odds of a reorg that big, but short of some sort of attack or buggy mainchain release, it would seem to be quite small chance if 10 blocks gets to six sigma

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

AlexGR

Legendary

Offline

Activity: 1708
Merit: 1049

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 03:22:38 PM

#7

Quote from: jl777 on February 26, 2016, 03:50:34 AM

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync.

Bandwidth limited = I/O and cpu limited then?

We had a similar discussion in another thread, but anyway I want to write it more concisely here: I really believe scaling can be achieved through tradeoffs of what a user has and what he hasn't. Every machine will hit a bottleneck somewhere, the issue from that point onward is how to juggle the tradeoffs to bypass your bottleneck.

Examples:

-A user has a slow internet connection but plenty of CPU => He should have the option of downloading (and uploading - if he is running a node) very compressed blocks with other cooperating nodes that are willing to use compressed blocks.
-If the same user has slow CPU as well => He should be able to tap into GPU resources for parallel compression/decompression (it's feasible - I've seen pdf's with research on the subject of parallel compression on GPU).
-If a user has slow I/O but fast CPU => Compress the blockchain for more local throughput
-If a user has low RAM but fast CPU => Compress the RAM (for things like compressed cache)
-If a user has slow I/O or low RAM but will bottleneck on CPU if he goes to a compressed scheme => tap into GPU
-If a user wants to increase cache size to increase performance, but lacks ram => Use compressed RAM, tap into more CPU, to reduce I/O.

Essentially you try to trade what you have with what you want to achieve. One size fits all solutions won't work particularly well because one user has 1gbps connection, another has 4mbps. One is running on a xeon, the other is running on a laptop. One has 16gb memory, another has 2. One can tap into a GPU for parallel processes, another can't because he doesn't have one.

IMO the best scaling solutions will be adjustable so that bottlenecks can be identified in a particular system and make the necessary tradeoffs to compensate for deficiencies. Compression is definitely one of the tools that can balance these needs, whether one lacks RAM, disk space on their SSD, bandwidth on their network etc. And GPUs can definitely help in the processing task - whether the load is directly related to btc operations or compression/decompression.

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 04:07:13 PM

#8

Quote from: AlexGR on March 03, 2016, 03:22:38 PM

Quote from: jl777 on February 26, 2016, 03:50:34 AM

These are not just thoughts, but description of a working system that syncs entire blockchain and creates the above data structures in about half hour if you have enough bandwidth and 8 cores. It is bandwidth limited, so for a slow home connection of 20mbps, it takes 6hrs for full sync.

Bandwidth limited = I/O and cpu limited then?

We had a similar discussion in another thread, but anyway I want to write it more concisely here: I really believe scaling can be achieved through tradeoffs of what a user has and what he hasn't. Every machine will hit a bottleneck somewhere, the issue from that point onward is how to juggle the tradeoffs to bypass your bottleneck.

Examples:

-A user has a slow internet connection but plenty of CPU => He should have the option of downloading (and uploading - if he is running a node) very compressed blocks with other cooperating nodes that are willing to use compressed blocks.
-If the same user has slow CPU as well => He should be able to tap into GPU resources for parallel compression/decompression (it's feasible - I've seen pdf's with research on the subject of parallel compression on GPU).
-If a user has slow I/O but fast CPU => Compress the blockchain for more local throughput
-If a user has low RAM but fast CPU => Compress the RAM (for things like compressed cache)
-If a user has slow I/O or low RAM but will bottleneck on CPU if he goes to a compressed scheme => tap into GPU
-If a user wants to increase cache size to increase performance, but lacks ram => Use compressed RAM, tap into more CPU, to reduce I/O.

Essentially you try to trade what you have with what you want to achieve. One size fits all solutions won't work particularly well because one user has 1gbps connection, another has 4mbps. One is running on a xeon, the other is running on a laptop. One has 16gb memory, another has 2. One can tap into a GPU for parallel processes, another can't because he doesn't have one.

IMO the best scaling solutions will be adjustable so that bottlenecks can be identified in a particular system and make the necessary tradeoffs to compensate for deficiencies. Compression is definitely one of the tools that can balance these needs, whether one lacks RAM, disk space on their SSD, bandwidth on their network etc. And GPUs can definitely help in the processing task - whether the load is directly related to btc operations or compression/decompression.

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

The CPU is a bottleneck only if your connections is faster than 500megabits/sec, probably half that as that is without verifying the sigs yet, but there is plenty of idle time during the sync.

With the bandwidth varying, we have 30 minutes to 6 hours time to sync. The slower connection provides plenty of time for the CPU

I eliminated HDD as a bottleneck by eliminating most of the seeks and using an append only set of files direct into memory mappable data structures.

Each bundle can then be verified independently and a chain of the bundle summaries can then be used for a quick start, with the validity checked over the next 6hrs. Just need to not allow tx until it is fully verified, or limit the tx to ones that are known to be valid, ie fresh addresses getting new funds sent to it.

I use adaptive methods so it goes as fast as the entire system is able to. After all if you only have one core, then it will be the bottleneck to valiate 100,000 blocks of sigs. The way to have different "sizes" is to have a full node, validating node, truncated node (dump sigs after they are verified), lite nodes.

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

AlexGR

Legendary

Offline

Activity: 1708
Merit: 1049

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 04:59:07 PM

#9

Quote from: jl777 on March 03, 2016, 04:07:13 PM

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

What seems incompressible can generally be compressed more if one shifts the order of characters around a file. For example: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform (that's what bzip does btw). In theory a compression system could use multiple such "tables" and try various configurations to find new redundancies, folding and re-folding the data again and again, but obviously I'm not going to ask you to invent such a system Cool

In any case, whatever degree of compression is achieved => is good. If a compressed block, for example, can be reduced from 1mb to 600kb, it's a huge improvement for someone with a slow connection.

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 03, 2016, 05:14:15 PM

#10

Quote from: AlexGR on March 03, 2016, 04:59:07 PM

Quote from: jl777 on March 03, 2016, 04:07:13 PM

The problem is that a lot of the data is high entropy and not compressible. I believe I have identified and extracted most of the redundancy. compressing the dataset is a onetime operation and doesnt really take much time to decompress.

What seems incompressible can generally be compressed more if one shifts the order of characters around a file. For example: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform (that's what bzip does btw). In theory a compression system could use multiple such "tables" and try various configurations to find new redundancies, folding and re-folding the data again and again, but obviously I'm not going to ask you to invent such a system Cool

In any case, whatever degree of compression is achieved => is good. If a compressed block, for example, can be reduced from 1mb to 600kb, it's a huge improvement for someone with a slow connection.

if you can compress rmd160(sha256)) and sha256(sha256()) by any more than a few percent, it probably proves they are not of cryptographic strength and open to plain text attacks among others. SHA256 is supposed to be high entropy, right? If so, it isnt compressible.

Given this, removing the duplication, ie numvouts references to identical txid as they all get spent, identical rmd160[20] being referenced in the reused addresses along with the pubkeys when signed.

So what I do is to create a structured data set without the high entropy and then compress just that part to achieve a 75% compression. Using parallel sync the time to sync that on a 20 mbps home connection would be less than 2 hours, but that leaves out the sigs, which double the size to about 30GB total. However, if we can skip all the sigs before block 290000, then probably 10GB+ of that is not needed.

on /r/Bitcoin front page there is a mention of this with a lot of details on the internals

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

watashi-kokoto

Sr. Member

Offline

Activity: 682
Merit: 268

Re: Using compact indexes instead of hashes as identifiers.

March 07, 2016, 05:43:38 PM

#11

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

amaclin

Legendary

Offline

Activity: 1260
Merit: 1019

Re: Using compact indexes instead of hashes as identifiers.

March 07, 2016, 06:38:40 PM

#12

Quote from: watashi-kokoto on March 07, 2016, 05:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 09, 2016, 08:28:24 PM

#13

Quote from: amaclin on March 07, 2016, 06:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 05:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

amaclin

Legendary

Offline

Activity: 1260
Merit: 1019

Re: Using compact indexes instead of hashes as identifiers.

March 09, 2016, 09:10:54 PM

#14

Quote from: jl777 on March 09, 2016, 08:28:24 PM

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files
and yes, the index data is compressible, about 2x

Why not to delete all block files? Grin

The algorithm would be:
1) check incoming transactions (wild and confirmed) for ther validity except checking that the inputs are unspent or even exist
2) relay valid transactions to your peers
3) keep only last 100 (200-500-2016) blocks in the blockchain on hard drive

This would save 95% of hard disk space

AlexGR

Legendary

Offline

Activity: 1708
Merit: 1049

Re: Using compact indexes instead of hashes as identifiers.

March 09, 2016, 09:24:53 PM

#15

Quote from: jl777 on March 09, 2016, 08:28:24 PM

Quote from: amaclin on March 07, 2016, 06:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 05:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

What compression are you using?

LRZIP is *very* good for large files (the larger the file, the more redundancies it can find).

(A typical 130mb block.dat will be down to 97-98mb with gzip/bzip2, but can go under 90 with lrzip).

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 10, 2016, 12:04:10 AM

#16

Quote from: AlexGR on March 09, 2016, 09:24:53 PM

Quote from: jl777 on March 09, 2016, 08:28:24 PM

Quote from: amaclin on March 07, 2016, 06:38:40 PM

Quote from: watashi-kokoto on March 07, 2016, 05:43:38 PM

Segwit prunes old signatures, and since the signatures are major source of entropy it may make the leftover better compressible

What?

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files

and yes, the index data is compressible, about 2x

What compression are you using?

LRZIP is *very* good for large files (the larger the file, the more redundancies it can find).

(A typical 130mb block.dat will be down to 97-98mb with gzip/bzip2, but can go under 90 with lrzip).

I am just using mksquashfs so I get a compressed readonly filesytem
this protects all the data from tampering so there is little need to reverify anything.
the default is compressing the index data in about half, and the sigs data is in separate directory now, so after initial validation it can just be deleted, unless you want to run a full relaying node.

I havent gotten a complete run yet with the latest iteration of data, so dont have exact sizes. I expect that the non-signature dataset will come in at less than 20GB for the full blockchain. The reason it gets such good compression is that most of the indices are small numbers that happen a lot, over and over. By mapping the high entropy pubkeys, txids, etc. to a 32 bit index, not only is there a 8x compression, the resulting index is compressible, so probably about a 12x compression.

the vin's are the bulkiest, but I encode that into a metascript as described in https://bitco.in/forum/threads/vinmetascript-encoding-03494921.942/

James

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 10, 2016, 12:05:45 AM

#17

Quote from: amaclin on March 09, 2016, 09:10:54 PM

Quote from: jl777 on March 09, 2016, 08:28:24 PM

I create separate files that just contain the signatures, so pruning them is a matter of just deleting the files
and yes, the index data is compressible, about 2x

Why not to delete all block files? Grin

The algorithm would be:
1) check incoming transactions (wild and confirmed) for ther validity except checking that the inputs are unspent or even exist
2) relay valid transactions to your peers
3) keep only last 100 (200-500-2016) blocks in the blockchain on hard drive

This would save 95% of hard disk space

sure for a pruned node that is fine, but I want a full node with smallest footprint
but given time I plan to support all the various different modes

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET

watashi-kokoto

Sr. Member

Offline

Activity: 682
Merit: 268

Re: Using compact indexes instead of hashes as identifiers.

March 10, 2016, 02:30:33 PM

#18

I'm looking forward segwit so much. Tried segnet from github but didn't observe anything

it would be so amazing to try some altcoin with segwit, just to test it.

TierNolan

Legendary

Offline

Activity: 1232
Merit: 1083

Re: Using compact indexes instead of hashes as identifiers.

March 10, 2016, 07:58:55 PM

#19

Quote from: jl777 on February 26, 2016, 03:50:34 AM

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes.

What exactly do you mean by "canonical encoding"? What is the privacy loss here?

1LxbG5cKXzTwZg9mjL3gaRE835uNQEteWF

jl777 (OP)

Legendary

Offline

Activity: 1176
Merit: 1132

Re: Using compact indexes instead of hashes as identifiers.

March 10, 2016, 09:06:58 PM

#20

Quote from: TierNolan on March 10, 2016, 07:58:55 PM

Quote from: jl777 on February 26, 2016, 03:50:34 AM

I think we have to admit that a large part of the BTC blockchain has been deanonymized. And until new methods like CT are adopted, this will only get worse.

So rather than fight a losing battle, why not accept that there are people that dont care much about privacy and convenience is more important. In fact they might be required to be public. The canonical encoding allows to encode the existing blockchain and future public blockchain at much better than any other method as it ends up as high entropy compressed 32bit numbers, vs 32byte txid + vout. The savings is much more than 12 bytes, it takes only 6 bytes to encode an unspent, so closer to 30 bytes.

What exactly do you mean by "canonical encoding"? What is the privacy loss here?

canonical encoding means a numbering system for each block, tx, vin, vout so that the same number references the same one. Since the blocks are ordered and the tx are ordered within each block and vins and vouts are ordered within each tx, this is a matter of just iterating through the blockchain in a deterministic way.

I have no idea how this would cause any privacy loss as it is just using 32 bit integers as pointers to the hashes. The privacy issue was raised as somehow a reason to not use efficient encoding.

James

yes, I understand that if the blockchain reorgs, the 32bit indexes will need to be recalculated for the ones affected and orphans have no index

http://www.digitalcatallaxy.com/report2015.html
100+ page annual report for SuperNET