Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: John Tobey on May 23, 2011, 07:05:33 AM



Title: [CORRECTED] Script calculates 71% freeable transactions
Post by: John Tobey on May 23, 2011, 07:05:33 AM
"nobody has done studies of the existing block chain to see how much space could be reclaimed."  --https://en.bitcoin.it/wiki/Scalability

So I gave it a go.  I hope someone will check my logic, especially as I am new to Python.

$ time src/bitcointools/freeable.py
loaded  125925  blocks
201085 of 544296 tx freeable (corrected: 387616 of 544296 tx)
63328760 of 168627001 tx bytes freeable (corrected: 123285900 of 168627001 tx bytes)

real   0m47.131s
user   0m43.799s
sys   0m2.744s
$ ls -l .bitcoin/blk0*.dat
-rw------- 1 jtobey jtobey 180119823 2011-05-23 00:42 .bitcoin/blk0001.dat

(edited: fixed bug http://forum.bitcoin.org/index.php?topic=9461.msg137059#msg137059 (http://forum.bitcoin.org/index.php?topic=9461.msg137059#msg137059)) diff of bitcointools:
Code:
diff --git a/deserialize.py b/deserialize.py
index fe0cb09..a67645b 100644
--- a/deserialize.py
+++ b/deserialize.py
@@ -75,6 +75,7 @@ def deserialize_TxOut(d, owner_keys=None):
 
 def parse_Transaction(vds):
   d = {}
+  start = vds.read_cursor  # XXX breaks BCDataStream abstraction, do we care?
   d['version'] = vds.read_int32()
   n_vin = vds.read_compact_size()
   d['txIn'] = []
@@ -85,6 +86,7 @@ def parse_Transaction(vds):
   for i in xrange(n_vout):
     d['txOut'].append(parse_TxOut(vds))
   d['lockTime'] = vds.read_uint32()
+  d['tx'] = vds.input[start:vds.read_cursor]
   return d
 def deserialize_Transaction(d, transaction_index=None, owner_keys=None):
   result = "%d tx in, %d out\n"%(len(d['txIn']), len(d['txOut']))
diff --git a/freeable.py b/freeable.py
new file mode 100755
index 0000000..c79f458
--- /dev/null
+++ b/freeable.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python
+#
+# Read the block database, find out how many transactions
+# could be purged and how many bytes they take up.
+# TODO: find out how many Merkle hashes could be purged.
+#
+from bsddb.db import *
+import logging
+import os
+import sys
+
+from BCDataStream import *
+from block import scan_blocks
+from deserialize import parse_Block
+from util import determine_db_dir, create_env
+import Crypto.Hash.SHA256 as SHA256
+
+import binascii  # debugging
+
+def main():
+  import optparse
+  parser = optparse.OptionParser(usage="%prog [options]")
+  parser.add_option("--datadir", dest="datadir", default=None,
+                    help="Look for files here (defaults to bitcoin default)")
+  (options, args) = parser.parse_args()
+
+  if options.datadir is None:
+    db_dir = determine_db_dir()
+  else:
+    db_dir = options.datadir
+
+  try:
+    db_env = create_env(db_dir)
+  except DBNoSuchFileError:
+    logging.error("Couldn't open " + db_dir)
+    sys.exit(1)
+
+  blockfile = open(os.path.join(db_dir, "blk%04d.dat"%(1,)), "rb")
+  block_datastream = BCDataStream()
+  block_datastream.map_file(blockfile, 0)
+
+  blocks = []
+  def gather_stats(block_data):
+    block_datastream.seek_file(block_data['nBlockPos'])
+    blocks.append(parse_Block(block_datastream))
+    return True
+
+  scan_blocks(db_dir, db_env, gather_stats)
+  blocks.reverse()
+  print 'loaded ', len(blocks), ' blocks'
+
+  tx = {}
+  total_tx = 0
+  freeable_tx = 0
+  total_bytes = 0
+  freeable_bytes = 0
+
+  for data in blocks:
+    coinbase = True
+    for txn in data['transactions']:
+      tx_hash = SHA256.new(SHA256.new(txn['tx']).digest()).digest()
+      #print '> ', binascii.hexlify(tx_hash)
+      tx_bytes = len(txn['tx'])
+      tx[tx_hash] = (tx_bytes, len(txn['txOut']))
+      total_tx += 1
+      total_bytes += tx_bytes
+
+      if coinbase:
+        coinbase = False
+      else:
+        for txin in txn['txIn']:
+          #print '< ', binascii.hexlify(txin['prevout_hash'])
+          (bytes, live) = tx[txin['prevout_hash']]
+          if live == 1:
+            freeable_bytes += bytes
+            freeable_tx += 1
+            del tx[txin['prevout_hash']]
+          else:
+            tx[txin['prevout_hash']] = (bytes, live - 1)
+
+  db_env.close()
+
+  print freeable_tx, 'of', total_tx, 'tx freeable'
+  print freeable_bytes, 'of', total_bytes, 'tx bytes freeable'
+
+if __name__ == '__main__':
+    main()


Title: Re: Script calculates freeable transactions
Post by: ByteCoin on May 23, 2011, 10:27:20 AM
201085 of 544296 tx freeable
63328760 of 168627001 tx bytes freeable


Good work! I hope I'll get around to checking it soon.

One complication - Although you can throw away the transactions, you have to keep the Merkle hash tree stubs so that's roughly 32 bytes per transaction (because the tree height is low) which you have to store.

ByteCoin


Title: Re: Script calculates freeable transactions
Post by: Mike Hearn on May 23, 2011, 11:33:26 AM
Great work! Simulating the merkle tree collapse is probably much harder, but even so the fact that 36% can be deleted today is great. I'd expect that number to only go up as the economy picks up pace.

Could you update the wiki page with your findings?


Title: Re: Script calculates freeable transactions
Post by: John Tobey on May 23, 2011, 02:15:36 PM
Great work! Simulating the merkle tree collapse is probably much harder, but even so the fact that 36% can be deleted today is great. I'd expect that number to only go up as the economy picks up pace.

Could you update the wiki page with your findings?

Updated: "As of May, 2011, the software does not implement pruning, and the savings are calculated {link here} at 37% of transactions or 35% of raw block bytes."

I will not have time to calculate the Merkle tree prunings.  I guess a lower bound is the transaction count times 32, and an upper bound is double that.  Adding the former number to the freeable tx bytes and dividing by the raw file size gives me 38.7%, which is slightly higher than the rate in both transaction count and transaction bytes.  This seems wrong to me, but I haven't given it serious thought.


Title: Re: Script calculates freeable transactions
Post by: John Tobey on May 23, 2011, 02:32:09 PM
This seems wrong to me, but I haven't given it serious thought.

OOH!  Bug fix.  "tx[txin['prevout_n']]" should be "tx[txin['prevout_hash']]".  This improves the stats quite a bit:

loaded  125925  blocks
387616 of 544296 tx freeable   (71.2%)
123285900 of 168627001 tx bytes freeable  (73.1%)

(387616*32+123285900)/180119823 = 75.3%, still too high I suspect.

Updating wiki...


Title: Re: [CORRECTED] Script calculates 71% freeable transactions
Post by: theymos on May 24, 2011, 02:43:44 AM
I get similar numbers using BBE data. 391288 transactions and 124638004 bytes over 126191 blocks.


Title: Re: [CORRECTED] Script calculates 71% freeable transactions
Post by: John Tobey on May 24, 2011, 01:25:48 PM
I get similar numbers using BBE data. 391288 transactions and 124638004 bytes over 126191 blocks.

Thanks for checking.