Bitcoin Forum
May 12, 2024, 09:12:44 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 [5] 6 7 8 9 10 11 »  All
  Print  
Author Topic: 60M posts! View unedited/deleted posts (search per post, per user or per topic)  (Read 8649 times)
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
February 22, 2020, 11:08:03 AM
 #81

If it is not a secret, how much data space is needed for all that millions of posts?
I'm currenly using 54 GB for loyce.club, and store 4.2 million files.

Quote
And is there a way to use some compression?
I mainly store HTML-files, so indeed, it would be great if a webbrowser would just be able to use index.html.gz to largely reduce the disk space consumption, but I just tested it and my browser doesn't get it.



Due to my lack of time it took longer than I wanted, but I now added live updates for posts per user and per topic:
Viewing unedited/deleted posts

How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.

Details
  • Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
  • I don't scrape hidden boards (such as Investigations).
  • I don't keep post titles
  • I save raw HTML, including quotes
  • If I run out of disk space, I might create compressed archives per 10,000 posts.
  • Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
  • My current (sponsored) webhost has enough storage space for years to come.
  • All scrape-times use Amsterdam time (CET).
  • Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.

Examples

1715505164
Hero Member
*
Offline Offline

Posts: 1715505164

View Profile Personal Message (Offline)

Ignore
1715505164
Reply with quote  #2

1715505164
Report to moderator
1715505164
Hero Member
*
Offline Offline

Posts: 1715505164

View Profile Personal Message (Offline)

Ignore
1715505164
Reply with quote  #2

1715505164
Report to moderator
The Bitcoin software, network, and concept is called "Bitcoin" with a capitalized "B". Bitcoin currency units are called "bitcoins" with a lowercase "b" -- this is often abbreviated BTC.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715505164
Hero Member
*
Offline Offline

Posts: 1715505164

View Profile Personal Message (Offline)

Ignore
1715505164
Reply with quote  #2

1715505164
Report to moderator
~DefaultTrust
Copper Member
Sr. Member
****
Offline Offline

Activity: 1554
Merit: 489

Stop the war!


View Profile
February 22, 2020, 01:22:26 PM
 #82

I now have the first 6.1 million Bitcointalk posts archived. Data processing took longer than expected, but it's published now. See link above.

38 million left.
How fast does your parsing work?

Do not trust bitcointalk fascists: leonello; Snork1979; ivan1975
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
February 22, 2020, 01:52:01 PM
 #83

How fast does your parsing work?
See:
I expect to complete this around August.

LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 02, 2020, 05:16:29 PM
 #84

How to use it
  • Find the msgID, userID or topicID you need. Let's use msgID 51902990.
  • Remove the last 4 digits from the msgID to get the directory name (if there are less than 4 digits, use 0): 5190.
  • Put everything together behind the (above) URL and add ".html": http://loyce.club/archive/posts/5190/51902990.html.
This is an example of how I use it in practice: I copy the topicID (5229466), then type "topics" on my URL-bar. My browser suggests http://loyce.club/archive/topics/, which I select. Then, I paste the topicID, hit Backspace 4 times, type "/", paste the topicID again, type ".html" and hit enter. It takes some getting used to, but in just 5 seconds I have the page I was looking for: http://loyce.club/archive/topics/522/5229466.html.

alani123
Legendary
*
Offline Offline

Activity: 2394
Merit: 1419


Leading Crypto Sports Betting & Casino Platform


View Profile
March 05, 2020, 02:10:01 PM
Merited by logfiles (1)
 #85

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?

..Stake.com..   ▄████████████████████████████████████▄
   ██ ▄▄▄▄▄▄▄▄▄▄            ▄▄▄▄▄▄▄▄▄▄ ██  ▄████▄
   ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██  ██████
   ██ ██████████ ██      ██ ██████████ ██   ▀██▀
   ██ ██      ██ ██████  ██ ██      ██ ██    ██
   ██ ██████  ██ █████  ███ ██████  ██ ████▄ ██
   ██ █████  ███ ████  ████ █████  ███ ████████
   ██ ████  ████ ██████████ ████  ████ ████▀
   ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██
   ██            ▀▀▀▀▀▀▀▀▀▀            ██ 
   ▀█████████▀ ▄████████████▄ ▀█████████▀
  ▄▄▄▄▄▄▄▄▄▄▄▄███  ██  ██  ███▄▄▄▄▄▄▄▄▄▄▄▄
 ██████████████████████████████████████████
▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄
█  ▄▀▄             █▀▀█▀▄▄
█  █▀█             █  ▐  ▐▌
█       ▄██▄       █  ▌  █
█     ▄██████▄     █  ▌ ▐▌
█    ██████████    █ ▐  █
█   ▐██████████▌   █ ▐ ▐▌
█    ▀▀██████▀▀    █ ▌ █
█     ▄▄▄██▄▄▄     █ ▌▐▌
█                  █▐ █
█                  █▐▐▌
█                  █▐█
▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█
▄▄█████████▄▄
▄██▀▀▀▀█████▀▀▀▀██▄
▄█▀       ▐█▌       ▀█▄
██         ▐█▌         ██
████▄     ▄█████▄     ▄████
████████▄███████████▄████████
███▀    █████████████    ▀███
██       ███████████       ██
▀█▄       █████████       ▄█▀
▀█▄    ▄██▀▀▀▀▀▀▀██▄  ▄▄▄█▀
▀███████         ███████▀
▀█████▄       ▄█████▀
▀▀▀███▄▄▄███▀▀▀
..PLAY NOW..
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 05, 2020, 05:20:16 PM
 #86

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

hosseinimr93
Legendary
*
Offline Offline

Activity: 2394
Merit: 5237



View Profile
March 05, 2020, 10:00:50 PM
Merited by alani123 (2)
 #87

I see here that a thread was captured with just the error message for wrong BBcode:
http://loyce.club/archive/posts/5395/53951381.html

Any way to include the content in spite of wrong BBcode?
I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.

Yes, that's an error (maybe a bug) from the forum.

That archived post was like the following post.

https://bitcointalk.org/index.php?topic=4337249.0

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
alani123
Legendary
*
Offline Offline

Activity: 2394
Merit: 1419


Leading Crypto Sports Betting & Casino Platform


View Profile
March 05, 2020, 10:36:30 PM
 #88

I don't do BBCode, I only read HTML. This error came from the forum, not from me. I just save the post as it was made before editing.
I should have guessed this. It must be the easiest way to scrape anyway...

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway.

Contents of posts otherwise invisible due including a table with broken tags are accessible to any forum member able to quote the post, but invisible in the eyes of robots. I don't see any utility for any poster to do this to their posts intentionally. If they can edit their threads contents could be replaced with something like a dot and be done with it.

But it could be that a few thousands of such posts exist. Google gives out 3100 results when you google ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org), some duplicates and some coming from signatures of course.
2550 results if you remove two users that came up with broken sugnatures ("INVALID BBCODE: close of unopened tag in table" site:bitcointalk.org -Gamesbuy -trinaldao)

Now, I'm stepping into territory of a sub-case in a sub-case, but if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing? Probably not. But strictly speaking it should be done if you'd want to grab everything that's available.

..Stake.com..   ▄████████████████████████████████████▄
   ██ ▄▄▄▄▄▄▄▄▄▄            ▄▄▄▄▄▄▄▄▄▄ ██  ▄████▄
   ██ ▀▀▀▀▀▀▀▀▀▀ ██████████ ▀▀▀▀▀▀▀▀▀▀ ██  ██████
   ██ ██████████ ██      ██ ██████████ ██   ▀██▀
   ██ ██      ██ ██████  ██ ██      ██ ██    ██
   ██ ██████  ██ █████  ███ ██████  ██ ████▄ ██
   ██ █████  ███ ████  ████ █████  ███ ████████
   ██ ████  ████ ██████████ ████  ████ ████▀
   ██ ██████████ ▄▄▄▄▄▄▄▄▄▄ ██████████ ██
   ██            ▀▀▀▀▀▀▀▀▀▀            ██ 
   ▀█████████▀ ▄████████████▄ ▀█████████▀
  ▄▄▄▄▄▄▄▄▄▄▄▄███  ██  ██  ███▄▄▄▄▄▄▄▄▄▄▄▄
 ██████████████████████████████████████████
▄▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▄
█  ▄▀▄             █▀▀█▀▄▄
█  █▀█             █  ▐  ▐▌
█       ▄██▄       █  ▌  █
█     ▄██████▄     █  ▌ ▐▌
█    ██████████    █ ▐  █
█   ▐██████████▌   █ ▐ ▐▌
█    ▀▀██████▀▀    █ ▌ █
█     ▄▄▄██▄▄▄     █ ▌▐▌
█                  █▐ █
█                  █▐▐▌
█                  █▐█
▀▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▀█
▄▄█████████▄▄
▄██▀▀▀▀█████▀▀▀▀██▄
▄█▀       ▐█▌       ▀█▄
██         ▐█▌         ██
████▄     ▄█████▄     ▄████
████████▄███████████▄████████
███▀    █████████████    ▀███
██       ███████████       ██
▀█▄       █████████       ▄█▀
▀█▄    ▄██▀▀▀▀▀▀▀██▄  ▄▄▄█▀
▀███████         ███████▀
▀█████▄       ▄█████▀
▀▀▀███▄▄▄███▀▀▀
..PLAY NOW..
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 06, 2020, 07:49:56 AM
 #89

https://bitcointalk.org/index.php?topic=4337249.0

Quote the post and then click on preview. You will see that the post is shown correctly. But it doesn't work when it is posted. It says "INVALID BBCODE: close of unopened tag in table (1)"
Maybe it hits the 64 kB limit in HTML, the Russian characters take a lot more space that way. I'm not sure if that's the limit though, I've made posts that take 80 kB when scraped.

Interesting how you can quote the post to see contents. I know this is a scenario that's too specific but I'll post my two cents anyway.
There are more bug in SMF that cause the preview to show differently than the real post.

Quote
if posts with broken bbcode are still unedited and quotable, then their contents could be salvaged. Could that be worth pursuing?
I only want to archive what the forum shows as public information.

LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 08, 2020, 10:17:54 AM
 #90

Update: I now have the first 11 million posts scraped! At the moment, the first 6.1 million are available on loyce.club, processing all data takes approximately 3 days. At the current rate, I'm on schedule to complete archiving all posts around August.
I've been thinking about expanding my archived posts to all posts that haven't been deleted yet.
An update: I have started this project! Measured in scraping time, it's the biggest project I ever started. In the past 9 days, I've scraped about 4% of all data, so I expect to complete this around August.
There's also a chance I'll run out of disk space because of the millions of large posts made by bounty spammers, but I'll deal with that when it happens.

Sneak preview: http://loyce.club/archive/oldposts/
How to use:
  • Find the msgID you need. Let's use 28228
  • Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
  • Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
  • Add "#msg" and the msgID: #msg28228
  • Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228

Limitations
  • Currently, the first 2.1 million posts are available.
  • I'll scrape the first 5.21 million topics and all posts in there.
  • That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
  • This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
  • The time "scraped on" is Amsterdam time.

If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over.

LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 14, 2020, 07:38:05 AM
 #91

Update: I now have the first 11 million posts scraped!
It took longer than expected (real life has been very busy lately), but it's done: I now have posts up to April 6, 2015 archived.

Out of these 11 million first posts, 1,520,880 (13.8%) are Deleted or Off-limits (most likely deleted).

nutildah
Legendary
*
Offline Offline

Activity: 2982
Merit: 7986



View Profile WWW
March 24, 2020, 05:03:39 AM
 #92

Update: I now have the first 11 million posts scraped!
It took longer than expected (real life has been very busy lately), but it's done: I now have posts up to April 6, 2015 archived.

Out of these 11 million first posts, 1,520,880 (13.8%) are Deleted or Off-limits (most likely deleted).

Hi Loyce, by any chance did you get around to archiving posts after April 2015 and up to where you started at July 2019? I'm looking for a post from July 2018. Thanks.

▄▄███████▄▄
▄██████████████▄
▄██████████████████▄
▄████▀▀▀▀███▀▀▀▀█████▄
▄█████████████▄█▀████▄
███████████▄███████████
██████████▄█▀███████████
██████████▀████████████
▀█████▄█▀█████████████▀
▀████▄▄▄▄███▄▄▄▄████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀
.
 MΞTAWIN  THE FIRST WEB3 CASINO   
.
.. PLAY NOW ..
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 24, 2020, 07:20:47 AM
 #93

I'm looking for a post from July 2018.
I started scraping 2 months later (I haven't published those posts online). My older post scraping project scrapes one thread at a time, so if the post was made in a old topic and only deleted recently, I might have it. But that's not very likely.

nutildah
Legendary
*
Offline Offline

Activity: 2982
Merit: 7986



View Profile WWW
March 24, 2020, 07:25:41 AM
 #94

I'm looking for a post from July 2018.
I started scraping 2 months later (I haven't published those posts online). My older post scraping project scrapes one thread at a time, so if the post was made in a old topic and only deleted recently, I might have it. But that's not very likely.

That's alright, I managed to find a copy of it elsewhere. Thanks for the info.

▄▄███████▄▄
▄██████████████▄
▄██████████████████▄
▄████▀▀▀▀███▀▀▀▀█████▄
▄█████████████▄█▀████▄
███████████▄███████████
██████████▄█▀███████████
██████████▀████████████
▀█████▄█▀█████████████▀
▀████▄▄▄▄███▄▄▄▄████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀
.
 MΞTAWIN  THE FIRST WEB3 CASINO   
.
.. PLAY NOW ..
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
March 25, 2020, 10:10:39 AM
Last edit: March 26, 2020, 12:36:18 PM by LoyceV
 #95

See http://loyce.club/archive/topics/ for posts made in a certain topic (Working!)
Updated every 5 minutes.
Notification: On many of those pages, the topic title was missing (due to an error). I've temporarily renamed the current version to http://loyce.club/archive/topics.old____fixing_errors_in_some_of_the_titles/. If you're looking for this data, use this, but please don't post links to that URL. Once the update is done, I'll remove this link.
The normal location, http://loyce.club/archive/topics/, has incomplete data at the moment.

Update: done! The normal link works again Smiley

LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
April 01, 2020, 08:14:00 AM
Last edit: April 01, 2020, 08:52:38 AM by LoyceV
 #96

Update: I now have the first 11 million posts scraped!
Some of today's weird usernames will most likely end up in my files forever! I scrape topics, and get the username from the post, not from the profile. Oh well.......
Update: Usernames aren't affected without logging in. I'm quite happy with that, as most of my scraping doesn't use an account.

cheater detector
Member
**
Offline Offline

Activity: 213
Merit: 53


View Profile
April 07, 2020, 10:18:04 AM
 #97

Hi LoyceV, have you archive this topic? https://bitcointalk.org/index.php?topic=5126495.0
You can check if you are not busy
Thank you very much
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
April 07, 2020, 11:02:45 AM
 #98

Hi LoyceV, have you archive this topic? https://bitcointalk.org/index.php?topic=5126495.0
Probably, it's part of this:
I also have older posts: I've saved (most) unedited posts (6.2 million posts) since September 12, 2018, until the start of this topic. This data has not been added to this topic, and I can't really add it because I tried to remove quotes and that has some bugs. You can request to dig up unedited data when needed.
But that's currently stored in large compressed files. Can you tell me what exactly you're looking for? I can probably dig up all posts made in that topic (without quotes and the above mentioned bugs), but it might be easier if you tell me what you're looking for.

cheater detector
Member
**
Offline Offline

Activity: 213
Merit: 53


View Profile
April 07, 2020, 11:16:41 AM
 #99

1. Shiversnow
BTC Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd



2. Ingoats
BTC Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi



I just only need these both post to make sure they're apply in that's campaign.
When I use google search I found it, but in the forum I couldn't find it.
LoyceV (OP)
Legendary
*
Offline Offline

Activity: 3304
Merit: 16644


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
April 07, 2020, 04:26:48 PM
 #100

1. Shiversnow
BTC Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd



2. Ingoats
BTC Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi



I just only need these both post to make sure they're apply in that's campaign.
When I use google search I found it, but in the forum I couldn't find it.
I searched starting at post 50385301 (March 30, 2019) until post 50844598 (May 1, 2019). Out of those 459298 posts, I have saved 431578. I must have experienced some down time.

I found "12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd" in only this (unedited) post:
Code:
Shiversnow
972438
50393313
<a href="https://bitcointalk.org/index.php#3">Economy</a> / <a href="https://bitcointalk.org/index.php?board=52.0">Services</a> / <b><a href="https://bitcointalk.org/index.php?topic=5126495.msg50393313#msg50393313">Re: [OPEN] [SIGNATURE CAMPAIGN] BLOCKCHAIN ZERO TO ONE </a></b>

#Proof of Authentication<br />Bitcointalk Username: Shiversnow<br />Rank: Full Member<br />Bitcoin Wallet Address: 12ujAKqXCwxFipZ6a8xdpXAo7EoitSGwMd

I found "12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi" in only this (unedited) post:
Code:
Ingoats
1083582
50393708
<a href="https://bitcointalk.org/index.php#3">Economy</a> / <a href="https://bitcointalk.org/index.php?board=52.0">Services</a> / <b><a href="https://bitcointalk.org/index.php?topic=5126495.msg50393708#msg50393708">Re: [OPEN] [SIGNATURE CAMPAIGN] BLOCKCHAIN ZERO TO ONE </a></b>

#Proof of Authentication<br />Bitcointalk Username: Ingoats<br />Rank: Full Member<br />Bitcoin Wallet Address: 12iAFdKUFf2BincSosJt3Ns2x1xSS3okFi<br />

Pages: « 1 2 3 4 [5] 6 7 8 9 10 11 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!