LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
February 22, 2020, 11:08:03 AM |
|
If it is not a secret, how much data space is needed for all that millions of posts? I'm currenly using 54 GB for loyce.club, and store 4.2 million files. And is there a way to use some compression? I mainly store HTML-files, so indeed, it would be great if a webbrowser would just be able to use index.html.gz to largely reduce the disk space consumption, but I just tested it and my browser doesn't get it.
Due to my lack of time it took longer than I wanted, but I now added live updates for posts per user and per topic: Viewing unedited/deleted postsHow to use it- Find the msgID, userID or topicID you need. Let's use msgID 51902990.
- Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
- Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
Details- Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
- I don't scrape hidden boards (such as Investigations).
- I don't keep post titles
- I save raw HTML, including quotes
- If I run out of disk space, I might create compressed archives per 10,000 posts.
- Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
- My current (sponsored) webhost has enough storage space for years to come.
- All scrape-times use Amsterdam time (CET).
- Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.
Examples
|
|
|
|
|
|
|
|
The Bitcoin software, network, and concept is called "Bitcoin" with a capitalized "B". Bitcoin currency units are called "bitcoins" with a lowercase "b" -- this is often abbreviated BTC.
|
|
|
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
|
|
~DefaultTrust
Copper Member
Sr. Member
Offline
Activity: 1554
Merit: 489
Stop the war!
|
|
February 22, 2020, 01:22:26 PM |
|
I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above. 38 million left. How fast does your parsing work?
|
Do not trust bitcointalk fascists: leonello; Snork1979; ivan1975
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
February 22, 2020, 01:52:01 PM |
|
How fast does your parsing work? See: I expect to complete this around August.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 02, 2020, 05:16:29 PM |
|
How to use it- Find the msgID, userID or topicID you need. Let's use msgID 51902990.
- Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
- Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
This is an example of how I use it in practice: I copy the topicID (5229466), then type "topics" on my URL-bar. My browser suggests http://loyce.club/archive/topics/, which I select. Then, I paste the topicID, hit Backspace 4 times, type "/", paste the topicID again, type ".html" and hit enter. It takes some getting used to, but in just 5 seconds I have the page I was looking for: http://loyce.club/archive/topics/522/5229466.html.
|
|
|
|
alani123
Legendary
Offline
Activity: 2394
Merit: 1419
Leading Crypto Sports Betting & Casino Platform
|
|
March 05, 2020, 02:10:01 PM |
|
I see here that a thread was captured with just the error message for wrong BBcode: http://loyce.club/archive/posts/5395/53951381.htmlAny way to include the content in spite of wrong BBcode?
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 05, 2020, 05:20:16 PM |
|
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
|
|
|
|
hosseinimr93
Legendary
Offline
Activity: 2394
Merit: 5237
|
|
March 05, 2020, 10:00:50 PM |
|
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing. Yes, that's an error (maybe a bug) from the forum. That archived post was like the following post. https://bitcointalk.org/index.php?topic=4337249.0Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
|
. .BLACKJACK ♠ FUN. | | | ███▄██████ ██████████████▀ ████████████ █████████████████ ████████████████▄▄ ░█████████████▀░▀▀ ██████████████████ ░██████████████ █████████████████▄ ░██████████████▀ ████████████ ███████████████░██ ██████████ | | CRYPTO CASINO & SPORTS BETTING | | │ | | │ | ▄▄███████▄▄ ▄███████████████▄ ███████████████████ █████████████████████ ███████████████████████ █████████████████████████ █████████████████████████ █████████████████████████ ███████████████████████ █████████████████████ ███████████████████ ▀███████████████▀ ███████████████████ | | .
|
|
|
|
alani123
Legendary
Offline
Activity: 2394
Merit: 1419
Leading Crypto Sports Betting & Casino Platform
|
|
March 05, 2020, 10:36:30 PM |
|
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
I should have guessed this. It must be the easiest way to scrape anyway... Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway. Contents of posts otherwise invisible due including a table with broken tags are accessible to any forum member able to quote the post, but invisible in the eyes of robots. I don't see any utility for any poster to do this to their posts intentionally. If they can edit their threads contents could be replaced with something like a dot and be done with it. But it could be that a few thousands of such posts exist. Google gives out 3100 results when you google ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org), some duplicates and some coming from signatures of course. 2550 results if you remove two users that came up with broken sugnatures ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org -Gamesbuy -trinaldao) Now, I'm stepping into territory of a sub-case in a sub-case, but if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing? Probably not. But strictly speaking it should be done if you'd want to grab everything that's available.
|
..Stake.com.. | | | ▄████████████████████████████████████▄ ██ ▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄ ██ ▄████▄ ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██ ██████ ██ ██████████ ██ ██ ██████████ ██ ▀██▀ ██ ██ ██ ██████ ██ ██ ██ ██ ██ ██ ██████ ██ █████ ███ ██████ ██ ████▄ ██ ██ █████ ███ ████ ████ █████ ███ ████████ ██ ████ ████ ██████████ ████ ████ ████▀ ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██ ██ ▀▀▀▀▀▀▀▀▀▀ ██ ▀█████████▀ ▄████████████▄ ▀█████████▀ ▄▄▄▄▄▄▄▄▄▄▄▄███ ██ ██ ███▄▄▄▄▄▄▄▄▄▄▄▄ ██████████████████████████████████████████ | | | | | | ▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄ █ ▄▀▄ █▀▀█▀▄▄ █ █▀█ █ ▐ ▐▌ █ ▄██▄ █ ▌ █ █ ▄██████▄ █ ▌ ▐▌ █ ██████████ █ ▐ █ █ ▐██████████▌ █ ▐ ▐▌ █ ▀▀██████▀▀ █ ▌ █ █ ▄▄▄██▄▄▄ █ ▌▐▌ █ █▐ █ █ █▐▐▌ █ █▐█ ▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█ | | | | | | ▄▄█████████▄▄ ▄██▀▀▀▀█████▀▀▀▀██▄ ▄█▀ ▐█▌ ▀█▄ ██ ▐█▌ ██ ████▄ ▄█████▄ ▄████ ████████▄███████████▄████████ ███▀ █████████████ ▀███ ██ ███████████ ██ ▀█▄ █████████ ▄█▀ ▀█▄ ▄██▀▀▀▀▀▀▀██▄ ▄▄▄█▀ ▀███████ ███████▀ ▀█████▄ ▄█████▀ ▀▀▀███▄▄▄███▀▀▀ | | | ..PLAY NOW.. |
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 06, 2020, 07:49:56 AM |
|
Maybe it hits the 64 kB limit in HTML, the Russian characters take a lot more space that way. I'm not sure if that's the limit though, I've made posts that take 80 kB when scraped. Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway. There are more bug in SMF that cause the preview to show differently than the real post. if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing? I only want to archive what the forum shows as public information.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 08, 2020, 10:17:54 AM |
|
Update: I now have the first 11 million posts scraped! At the moment, the first 6.1 million are available on loyce.club, processing all data takes approximately 3 days. At the current rate, I'm on schedule to complete archiving all posts around August. I've been thinking about expanding my archived posts to all posts that haven't been deleted yet. An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August. There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens. Sneak preview: http://loyce.club/archive/oldposts/How to use: - Find the msgID you need. Let's use 28228
- Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
- Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
- Add "#msg" and the msgID: #msg28228
- Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228
Limitations- Currently, the first 2.1 million posts are available.
- I'll scrape the first 5.21 million topics and all posts in there.
- That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
- This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
- The time "scraped on" is Amsterdam time.
If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 14, 2020, 07:38:05 AM |
|
Update: I now have the first 11 million posts scraped! It took longer than expected (real life has been very busy lately), but it's done: I now have posts up to April 6, 2015 archived. Out of these 11 million first posts, 1,520,880 (13.8%) are Deleted or Off-limits (most likely deleted).
|
|
|
|
nutildah
Legendary
Offline
Activity: 2982
Merit: 7986
|
|
March 24, 2020, 05:03:39 AM |
|
Update: I now have the first 11 million posts scraped! It took longer than expected (real life has been very busy lately), but it's done: I now have posts up to April 6, 2015 archived. Out of these 11 million first posts, 1,520,880 (13.8%) are Deleted or Off-limits (most likely deleted). Hi Loyce, by any chance did you get around to archiving posts after April 2015 and up to where you started at July 2019? I'm looking for a post from July 2018. Thanks.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 24, 2020, 07:20:47 AM |
|
I'm looking for a post from July 2018. I started scraping 2 months later (I haven't published those posts online). My older post scraping project scrapes one thread at a time, so if the post was made in a old topic and only deleted recently, I might have it. But that's not very likely.
|
|
|
|
nutildah
Legendary
Offline
Activity: 2982
Merit: 7986
|
|
March 24, 2020, 07:25:41 AM |
|
I'm looking for a post from July 2018. I started scraping 2 months later (I haven't published those posts online). My older post scraping project scrapes one thread at a time, so if the post was made in a old topic and only deleted recently, I might have it. But that's not very likely. That's alright, I managed to find a copy of it elsewhere. Thanks for the info.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
March 25, 2020, 10:10:39 AM Last edit: March 26, 2020, 12:36:18 PM by LoyceV |
|
Notification: On many of those pages, the topic title was missing (due to an error). I've temporarily renamed the current version to http://loyce.club/archive/topics.old____fixing_errors_in_some_of_the_titles/. If you're looking for this data, use this, but please don't post links to that URL. Once the update is done, I'll remove this link. The normal location, http://loyce.club/archive/topics/, has incomplete data at the moment.Update: done! The normal link works again
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
April 01, 2020, 08:14:00 AM Last edit: April 01, 2020, 08:52:38 AM by LoyceV |
|
Update: I now have the first 11 million posts scraped! Some of today's weird usernames will most likely end up in my files forever! I scrape topics, and get the username from the post, not from the profile. Oh well.......Update: Usernames aren't affected without logging in. I'm quite happy with that, as most of my scraping doesn't use an account.
|
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
April 07, 2020, 11:02:45 AM |
|
Probably, it's part of this: I also have older posts: I've saved (most) unedited posts (6.2 million posts) since September 12, 2018, until the start of this topic. This data has not been added to this topic, and I can't really add it because I tried to remove quotes and that has some bugs. You can request to dig up unedited data when needed. But that's currently stored in large compressed files. Can you tell me what exactly you're looking for? I can probably dig up all posts made in that topic (without quotes and the above mentioned bugs), but it might be easier if you tell me what you're looking for.
|
|
|
|
cheater detector
Member
Offline
Activity: 213
Merit: 53
|
|
April 07, 2020, 11:16:41 AM |
|
1. ShiversnowBTC Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd2. IngoatsBTC Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFiI just only need these both post to make sure they're apply in that's campaign. When I use google search I found it, but in the forum I couldn't find it.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3304
Merit: 16644
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
April 07, 2020, 04:26:48 PM |
|
1. ShiversnowBTC Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd2. IngoatsBTC Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFiI just only need these both post to make sure they're apply in that's campaign. When I use google search I found it, but in the forum I couldn't find it. I searched starting at post 50385301 (March 30, 2019) until post 50844598 (May 1, 2019). Out of those 459298 posts, I have saved 431578. I must have experienced some down time. I found "12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd" in only this (unedited) post: Shiversnow 972438 50393313 <a href="https://bitcointalk.org/index.php#3">Economy</a> / <a href="https://bitcointalk.org/index.php?board=52.0">Services</a> / <b><a href="https://bitcointalk.org/index.php?topic=5126495.msg50393313#msg50393313">Re: [OPEN] [SIGNATURE CAMPAIGN] BLOCKCHAIN ZERO TO ONE </a></b>
#Proof of Authentication<br />Bitcointalk Username: Shiversnow<br />Rank: Full Member<br />Bitcoin Wallet Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd I found "12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi" in only this (unedited) post: Ingoats 1083582 50393708 <a href="https://bitcointalk.org/index.php#3">Economy</a> / <a href="https://bitcointalk.org/index.php?board=52.0">Services</a> / <b><a href="https://bitcointalk.org/index.php?topic=5126495.msg50393708#msg50393708">Re: [OPEN] [SIGNATURE CAMPAIGN] BLOCKCHAIN ZERO TO ONE </a></b>
#Proof of Authentication<br />Bitcointalk Username: Ingoats<br />Rank: Full Member<br />Bitcoin Wallet Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi<br />
|
|
|
|
|