Bitcoin Forum
April 28, 2024, 08:28:53 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: gz.blockchair.com data dumps - better way to store it?  (Read 81 times)
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1582
Merit: 6696


bitcoincleanup.com / bitmixlist.org


View Profile WWW
June 27, 2023, 11:11:59 AM
Merited by ABCbits (1)
 #1

I've just did a crawl through Blockchair's data dump repository at, https://gz.blockchair.com, the total size of all the data on the site, including blocks, txs, outputs, inputs, etc. from all the chains is (as of today) about 2.7 terabytes. Here is the command I used to measure it in bytes: wget --mirror --no-host-directories -e robots=off --reject html  -l 0 --spider https://gz.blockchair.com 2>&1 | grep -E -o 'Length: [0-9]+' | awk '{sum += $2} END {print sum}' it only takes a few hours to run.

It seems to be a better alternative to using the Blockchair API proper, which seems to just randomly ban IP addresses without a paid API key.

Now since the data files are all in CSV format, just with tabs separated by spaces, I was wondering what is the best way to compress all this data, per chain at least? I know that CSV is a very inefficient representation as there's already megabytes of TAB characters, and there's no reason to store those either, so it's not like just compressing this with XZ or LZMA is the best solution.

Nevertheless it looks like all this stuff can be distributed at a reasonable size via Bittorrent - and even can be used to accelerate crypto applications so that they just need to fetch today's data online - if the dumps are compressed enough (per chain - don't want to mix up chain stuff). I would've liked to try it myself, but unfortunately this project needs 2x the disk space I have available right now.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
If you see garbage posts (off-topic, trolling, spam, no point, etc.), use the "report to moderator" links. All reports are investigated, though you will rarely be contacted about your reports.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714292933
Hero Member
*
Offline Offline

Posts: 1714292933

View Profile Personal Message (Offline)

Ignore
1714292933
Reply with quote  #2

1714292933
Report to moderator
1714292933
Hero Member
*
Offline Offline

Posts: 1714292933

View Profile Personal Message (Offline)

Ignore
1714292933
Reply with quote  #2

1714292933
Report to moderator
1714292933
Hero Member
*
Offline Offline

Posts: 1714292933

View Profile Personal Message (Offline)

Ignore
1714292933
Reply with quote  #2

1714292933
Report to moderator
witcher_sense
Legendary
*
Offline Offline

Activity: 2310
Merit: 4313

🔐BitcoinMessage.Tools🔑


View Profile WWW
June 28, 2023, 10:07:59 AM
Merited by ABCbits (1)
 #2

Use Apache Parquet instead of .csv or .tsv files: https://www.databricks.com/glossary/what-is-parquet

Quote
Characteristics of Parquet

Free and open source file format.
Language agnostic.
Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
Highly efficient data compression and decompression.
Supports complex data types and advanced nested data structures.

Here is a python script to convert tsv to parquet with pandas: https://stackoverflow.com/questions/26124417/how-to-convert-a-csv-file-to-parquet

█▀▀▀











█▄▄▄
▀▀▀▀▀▀▀▀▀▀▀
e
▄▄▄▄▄▄▄▄▄▄▄
█████████████
████████████▄███
██▐███████▄█████▀
█████████▄████▀
███▐████▄███▀
████▐██████▀
█████▀█████
███████████▄
████████████▄
██▄█████▀█████▄
▄█████████▀█████▀
███████████▀██▀
████▀█████████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
c.h.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▀▀▀█











▄▄▄█
▄██████▄▄▄
█████████████▄▄
███████████████
███████████████
███████████████
███████████████
███░░█████████
███▌▐█████████
█████████████
███████████▀
██████████▀
████████▀
▀██▀▀
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!