Bitcoin Forum
May 08, 2024, 07:04:26 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: How would you like the data to be processed?
Download a complete dump then process - 8 (80%)
Process during download - 2 (20%)
Total Voters: 10

Pages: [1] 2 3 4 5 6 »  All
  Print  
Author Topic: Discussion for MtGox trade data downloader  (Read 14338 times)
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
May 31, 2013, 01:27:36 PM
 #1

Hello,

As you may know, I am developing a tool to download data from MtGox's bigquery database (see here for info on this db). The tool is being written in python, but hopefully I will be able to also release it as a multi-platform self-contained app for non-programmers as well. I also plan on supporting several different formats, e.g. phantomcircuit's sqlite, csv, full sqlite dump, among others (it will also be easy for anyone who can program to implement their own).

The problem is that bigquery can be quite slow to sort this dataset, and downloading (and maintaining) the data using these ordered queries can quickly use up the free bigquery limits, so it is only really feasible to obtain the raw table data. This is good because it's quick and uses up 0 bytes of processing quota, however it also comes in unordered. For database formats this doesn't matter, but for formats like CSV, it's not really desirable to have records all over the place.

  • To solve this, I propose that the tool will download a full and complete sqlite dump of the data, which can then be used on its own or transcribed to other formats (in order), e.g. CSV, phantomcircuit's format, etc. Unfortunately, however, this will result in more memory usage as you will probably have at least 2 copies of the data. I estimate that the full dump will be about 460 mb, and other formats will depend on what data is  included and in what format. This does have the benefit, however, that you will always have an up-to-date copy of the bq database, and so can generate whatever format you need, even if you hadn't anticipated it.
  • The alternative is to process it while it downloads, however this will mean that you will be locked into the format you specify, and it will be in the order of the bigquery database, so the CSV format, for example, would be difficult to make effective use of.

As you can see, there are two options with both advantages and disadvantages. Please select which option you would prefer in the poll above. It would also be good to hear your thoughts/discussion on this tool.

In the meantime, you can download the partially complete tool which is capable of downloading a complete dump to a sqlite3 database here: https://bitbucket.org/nitrous/bq. Its dependencies are my sqlite class (pykrete, here) and the google python client library here

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
1715195066
Hero Member
*
Offline Offline

Posts: 1715195066

View Profile Personal Message (Offline)

Ignore
1715195066
Reply with quote  #2

1715195066
Report to moderator
1715195066
Hero Member
*
Offline Offline

Posts: 1715195066

View Profile Personal Message (Offline)

Ignore
1715195066
Reply with quote  #2

1715195066
Report to moderator
1715195066
Hero Member
*
Offline Offline

Posts: 1715195066

View Profile Personal Message (Offline)

Ignore
1715195066
Reply with quote  #2

1715195066
Report to moderator
BitcoinCleanup.com: Learn why Bitcoin isn't bad for the environment
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715195066
Hero Member
*
Offline Offline

Posts: 1715195066

View Profile Personal Message (Offline)

Ignore
1715195066
Reply with quote  #2

1715195066
Report to moderator
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
May 31, 2013, 03:14:30 PM
 #2

N.B. On my ~8meg connection, the full download took around 2.5 hours, and resulted in a 457mb db

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
Loozik
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250


Born to chew bubble gum and kick ass


View Profile
May 31, 2013, 09:13:54 PM
 #3

I vote for ''download a complete dump then process''

Thanks
whydifficult
Sr. Member
****
Offline Offline

Activity: 287
Merit: 250



View Profile WWW
June 01, 2013, 09:33:06 PM
 #4

Awesome work.

I am currently doing some research on how to get historical data from different exchanges to enable backtesting features. I want to use your tool to download everything into a SQLite db so that my bot can read this data. However I am running into trouble installing your tool:

  • I have installed python 2.7 (requirement of Google API client library)
  • I have downloaded the Google API client library and updated the clients_secrets and its in folder A
  • I have downloaded bq and pykrete in folder B and C

Do I need to point bq to pykrete and the Google library or do I need to put certain files at certain places? Maybe there is an installation doc I missed?

Gekko a nodejs bitcoin trading bot!
Realtime Bitcoin Globe - visualizing all transactions and blocks
Tip jar (BTC): 1KyQdQ9ctjCrGjGRCWSBhPKcj5omy4gv5S
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 01, 2013, 09:51:31 PM
Last edit: June 01, 2013, 11:05:34 PM by nitrous
 #5

Awesome work.

I am currently doing some research on how to get historical data from different exchanges to enable backtesting features. I want to use your tool to download everything into a SQLite db so that my bot can read this data. However I am running into trouble installing your tool:

  • I have installed python 2.7 (requirement of Google API client library)
  • I have downloaded the Google API client library and updated the clients_secrets and its in folder A
  • I have downloaded bq and pykrete in folder B and C

Do I need to point bq to pykrete and the Google library or do I need to put certain files at certain places? Maybe there is an installation doc I missed?

Sorry I wasn't more clear on this. To install a python package, you need to run
Code:
sudo python ./setup.py install
from the directory using the console/terminal/command prompt. I'm assuming you're on linux or mac here, if you're on windows omit the `sudo` part. Once you've installed pykrete and the google library, you should then be able to run mtgox.py and it should start the download Smiley It should also be able to resume previous downloads if you cancel it.

EDIT: When the tool is more complete and is ready to be packaged up into a self-contained GUI app, all these dependency issues should be gone as they will be built-in Smiley My first quick attempt to do this using pyinstaller didn't work, which is slightly ominous, but hopefully other tools such as py2exe and py2app will be able to handle the google library properly.

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 01, 2013, 10:17:41 PM
 #6

After talking with Jordan Tigani from Google's bigquery project, it seems that this tool may stop working at some point after MtGox starts regular updates, because Google occasionally performs coalesce operations that don't respect the order of rows in a table. The initial download should always work properly, but if a coalesce occurs, then future updates may corrupt the local dump. Bigqeury do have plans to implement some way of respecting order, but it is not yet on the tables. Perhaps if we made a case to Google about the importance of this we could reach some kind of solution? Otherwise it will end up costing around $60/month to keep local copies regularly updated because `order by` queries will need to be run for each update instead of a simple (and free) tabledata:list. Running SQL queries would also slow down the tool speed.

The only other option I can envisage is someone performing a `SELECT * FROM [mt-gox:mtgox.trades] ORDER BY Money_Trade__ ASC` query every time the database is updated, saving it to a table, and then the tool would download from this table instead. If bigquery doesn't implement some kind of row order permanence, then I'll ask MagicalTux to consider doing this officially.

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 01, 2013, 11:00:04 PM
 #7

Maybe there is an installation doc I missed?

Hi again, I just updated the pykrete docs to be a lot more complete, and they now include some simple installation instructions Smiley

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
Loozik
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250


Born to chew bubble gum and kick ass


View Profile
June 01, 2013, 11:24:35 PM
 #8

After talking with Jordan Tigani from Google's bigquery project, it seems that this tool may stop working at some point after MtGox starts regular updates, because Google occasionally performs coalesce operations that don't respect the order of rows in a table.

I read your discussion with Jordan Tigani (I do not understand it very well)

Quote from: Jordan Tigani
Coalesce happens on the order of once every 300 times you append data to the table.

Quote from: nitrous
The database is being maintained by a financial company, and they plan to start updating it automatically (up to every 10mins).

and have one comment: if the acceleration rate of number of transaction (ticks) is maintained (you know, more people will perform transactions more frequently) one can expect that in a few years 300 appendices will be made within one second. A year ago millisecond timestamping was a standard for reporting ticks. At the moment microsecond timestamping (one millionth of a second) is being introduced by financial data vendors and exchanges. I do not think 10 mins update is sustainable in the long run.

Otherwise it will end up costing around $60/month to keep local copies regularly updated because `order by` queries will need to be run for each update instead of a simple (and free) tabledata:list. Running SQL queries would also slow down the tool speed.

Is $60/month the cost to be incurred by MtGox / you or by every single user of this service?

If bigquery doesn't implement some kind of row order permanence, then I'll ask MagicalTux to consider doing this officially.

Couldn't MtGox just invest in servers / rent a data centre and pay a programmer like you to make a dedicated service / tailor made service instead of spending many hours talking to Google (without a guarantee for a success) - time is money.

Then MtGox can simply provide data API service to customers who will pay e.g. BTC 0.5 per month - it is a standard practice in financial industry that data API is provided to users (i) for a fee or (ii) in exchange for maintaining a certain balance in the account (e.g. if you want API from Dukascopy forex broker you must have a balance of at least $100k)
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 02, 2013, 06:47:13 AM
 #9

Hi Loozik - the bq database is not updated every time a transaction occurs. Instead, MtGox will upload all new trades that occurred between the last update and then, about every 10mins-1hour. If there are trades occurring 300 times per second, then these updates would insert each of the new 180k-1.8m trades, so there's no problem there. Also note that MtGox does use microsecond timestamps (see the Money_Trade__ field).

If there is no resolution to the problem, then up to $60/month would be incurred by the users of the service, but this is only if the user were to update their local dump every 10 minutes. If they did so every 4 hours or less, it would be within the Google bq monthly quota, and would be free. I understand the usual forex policy for API access, but I'd like to avoid any changes that would charge for access.

There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
Loozik
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250


Born to chew bubble gum and kick ass


View Profile
June 02, 2013, 06:56:55 AM
 #10

There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.

Can you share your thoughts on possible solutions or is it too early?
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 02, 2013, 10:07:52 AM
Last edit: June 02, 2013, 11:50:15 AM by nitrous
 #11

There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.

Can you share your thoughts on possible solutions or is it too early?

Update: added fifth option (see the second row, using destination table)
Update 2: assuming that option 2 is feasible, MagicalTux has confirmed he will implement it Smiley

Possible solutionAssumptionsProsCons
There are two tables, trades1 and trades2. When it is time to update, MtGox appends to trades1, then copies this table to trades2. Next time, they append to trades2 and copy to trades1. Both tables are identical, and the tool can download from trades1 every 10 minutesThat copy operations preserve row order and prevent coalesce operationsFree, simple, quick, tool continues to work as it does now -- basically the ideal solutionNone
There are two tables, trades and trades_sorted. When it is time to update, MtGox appends to trades then performs a "SELECT * FROM trades ORDER BY Money_Trade__ ASC" using trades_sorted as the destination table (and using WRITE_TRUNCATE mode)That the sorted table will be guaranteed to remain sorted, and won't be coalescedTool continues to work as usual (switched to download from trades_sorted)MtGox must pay up to $60/month to implement this
Hybrid solution. The tool checks whether the trades table has been coalesced since last updated. If yes, then it downloads by SQL query, otherwise it downloads using the normal tabledata:list methodThat it is possible to check for coalescence, that checking is freeSince coalescence only occurs about every 2 days at most frequent, the usage should remain free, downloading from a query operation could be slightly quickerMore complex tool (only slightly though), query operations are slow (can take ~100s to sort the entire table)
Google makes the coalesce operation order-preservingThat changing the coalesce behaviour is easy and quickNo changes to toolAssuming that Google will change the way their service works just for our benefit
Google implements automatic table sorting for tables, and MtGox selects "Money_Trade__ ASC, Primary DESC" as the sorting orderThat Google would implement an entirely new functionalityNo changes to the tool, functionality could be useful for many people and other bigquery usersA lot to ask from Google, moreso than the last solution

As you can see, the first two rely on possible pre-existing behaviour, and the last two rely on Google implementing new behaviour, so the first two are more ideal. Unfortunately, however, Jordan and my timezones intersect at awkward times so correspondence can be a bit slow.



Update to the development:

The basic underlying tool mechanism is pretty ready, todos are:
  • Make the tool more reliable and not at risk of corruption should bq return unexpected results
  • Create a graphical interface
  • Create documentation (if necessary)
  • Test out different python packagers for creating apps/exes to see which works with the necessary dependencies, and release self-contained apps for different platforms (windows, mac), as well as an archived distribution for linux/mac
  • Take feedback and make any necessary improvements

I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort. Any python programmers though can feel free to fork my repo and work on these if you want Smiley

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
Loozik
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250


Born to chew bubble gum and kick ass


View Profile
June 02, 2013, 02:27:35 PM
 #12

I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort.

Enjoy your family holiday  Smiley
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 03, 2013, 07:22:21 PM
Last edit: June 04, 2013, 10:05:48 AM by nitrous
 #13

I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort.

Enjoy your family holiday  Smiley

Thank you Smiley



I just heard back from Jordan that copying does respect the order, so option 1 should definitely be plausible. Creating a table in a single operation should also result in stable order access, so option 2 is also possible. The only uncertainty I have is whether the table needs to be deleted and recreated to 'reset' the coalesce operation, or whether just using a WRITE_TRUNCATE will be sufficient. Either way we should be able to do this quite easily now Smiley

Update: Perfect! WRITE_TRUNCATE will be sufficient, so it should be quite easy to get this going.

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
8bitPunk
Member
**
Offline Offline

Activity: 70
Merit: 10



View Profile
June 11, 2013, 10:27:13 AM
 #14

Nitrous & MagicalTux - I really appreciate your effort in getting this data source available on big query.

I ran a query to pull all trades since 1 Jan 2013 and it looks complete up until 23 May 2013. However, I've also observed some gaps in the data where periods of hours or days do not have any trades (eg between 1367430878865520 & 1367547410438010). Yet there are 60K trades during that period in the data source, so my code has tripped up somewhere.

I just wanted to mention it in case it relates to the WRITE_TRUNCATE method to derive the sorted table. FWIW I queried the trades_raw table as a trades_sorted table wasn't available yet. I imagine that once you do have the sorted table then trades since 23 May will start being uploaded?

In the meantime I will use the python tool from Nitrous to get the complete dataset.

BTC 18bPunkuginRBm1Xz9mcgj8mWJnHDAW5Th | Ł LTCgXEdyBdoQ9WdF6JHi7Pa2EWtzbDjG76 | Ψ ATEBiTLkLpAYeW5hQknUfSvnb7Abbgegku
whydifficult
Sr. Member
****
Offline Offline

Activity: 287
Merit: 250



View Profile WWW
June 21, 2013, 05:27:44 PM
Last edit: June 21, 2013, 07:34:47 PM by whydifficult
 #15

Sorry for hijacking your topic but people using this tool might find this handy:

Using the database downloaded by the trade data downloader I wrote a tiny candleCalculator which calculates candles based on the trades in this database. I already calculated all hourly candles from Jun 26 2011 19:00:00 up to May 23 2013 16:00:00 (for BTCUSD, 16,699 candles), you can download them here.

If anyone wants to calculate their own candles check out the thread for the script.

(it's a small and easy script but you could always port it to python to make it part of the trade data downloader.)

Gekko a nodejs bitcoin trading bot!
Realtime Bitcoin Globe - visualizing all transactions and blocks
Tip jar (BTC): 1KyQdQ9ctjCrGjGRCWSBhPKcj5omy4gv5S
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
June 21, 2013, 05:47:19 PM
 #16

Sorry for hijacking your topic but people using this tool might find this handy:

Using the database downloaded by the trade data downloader I wrote a tiny candleCalculator which calculates candles based on the trades in this database. I already calculated all hourly candles from Jun 26 2011 19:00:00 up to May 23 2013 16:00:00 (for BTCUSD, 16,699 candles), you can download them here.

If anyone wants to calculate their own candles check out the thread for the script.

(it's a small and easy script but you could always port it to python to make it part of the trade data downloader.)

Thank you, that's great work Smiley I'll look into porting it to python and including it in the downloader soon if that's ok with you.

Just an update for anyone interested in this project: I've just finished two weeks of exams, and I'm now going on holiday, but I'll be back by the beginning of July and then I'll be able to put in some more work on this, thank you for your patience.

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
bitsalame
Donator
Hero Member
*
Offline Offline

Activity: 714
Merit: 510


Preaching the gospel of Satoshi


View Profile
June 26, 2013, 05:31:08 AM
 #17

we'll be going with solution #2. It seems the most reliable option Smiley

Sorry to hijack this thread but:
WHEN ON EARTH WILL YOU FIX THE ANDROID APP?
nitrous (OP)
Sr. Member
****
Offline Offline

Activity: 246
Merit: 250


View Profile
July 02, 2013, 07:05:01 PM
Last edit: September 13, 2013, 11:53:48 AM by nitrous
 #18

Hi everyone,

While I was away SerialVelocity helped to develop the start of a GUI, and I've now filled in some basic update functionality Smiley Here are some screenshots of where we're at now:



There is still much to do, however:
  • Custom dump location
  • Dump statistics
  • Graphical error handling
  • Exporting the data to different formats
  • Choosing what data to export
  • Saving export preferences so that exports can be routinely updated
  • Testing on multiple platforms
  • Packaging the app into a self-contained bundle

As you can see, there's a fair bit to do, but hopefully it shouldn't take too long.

Thanks,
nitrous

EDIT: To use this, run:
Code:
python gui.pyw

Donations: 1Q2EN7TzJ6z82xvmQrRoQoMf3Tf4rMCyvL
MtGox API v2 Unofficial Documentation: https://bitbucket.org/nitrous/mtgox-api/overview
MtGox API v2 Unofficial Documentation Forum Thread: https://bitcointalk.org/index.php?topic=164404.0
whydifficult
Sr. Member
****
Offline Offline

Activity: 287
Merit: 250



View Profile WWW
July 02, 2013, 07:20:33 PM
 #19

Looks great, will test it out tonight!

Thank you, that's great work Smiley I'll look into porting it to python and including it in the downloader soon if that's ok with you.

Yes of course! It would be much better to be part of your tool instead of being a separate script (relying on a separate dev environment). It is a pretty simple script but if you need help porting it just let me know. I would have done it myself if I was able to code something in Python (learning it is on my list).

--

Thanks for your contributions to the community Smiley We definitely need more initiatives like this.

Gekko a nodejs bitcoin trading bot!
Realtime Bitcoin Globe - visualizing all transactions and blocks
Tip jar (BTC): 1KyQdQ9ctjCrGjGRCWSBhPKcj5omy4gv5S
Loozik
Sr. Member
****
Offline Offline

Activity: 378
Merit: 250


Born to chew bubble gum and kick ass


View Profile
July 02, 2013, 07:32:42 PM
 #20

Thanks for updating.
Pages: [1] 2 3 4 5 6 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!