LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
June 05, 2020, 07:24:27 PM |
|
Millions more posts added:I have now archived the first 35.5 million posts, all available online. This included posts made in topics created until April 24, 2018 and currently fills 43 GB. Example: my first post! See this quote on how to use it: Sneak preview: http://loyce.club/archive/oldposts/How to use: - Find the msgID you need. Let's use 28228
- Remove the last 5 digits from the msgID to get the directory name (if there are less than 5 digits, use 0): 0
- Replace the last 2 digits of the msgID by xx, and add .html (if there are less than 5 digits, use 0xx): 282xx.html
- Add "#msg" and the msgID: #msg28228
- Put everything together and go to http://loyce.club/archive/oldposts/0/282xx.html#msg28228
Limitations- Currently, the first 2.1 million posts are available.
- I'll scrape the first 5.21 million topics and all posts in there.
- That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
- This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
- The time "scraped on" is Amsterdam time.
If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over. This bug is not fixed yet: I found a bug (which I'm posting here as a reminder to myself): Posts on the עברי (Hebrew) board don't show up. Example: this post is missing, while it exists. I'll see if I can add them later. I think it has something to do with the right-to-left writing, even selecting text on that board doesn't work as expected. Update: عربية (Arabic) has the same problem. I'll re-scrape these boards after finishing scraping all posts.
Todo:When I have the time, I'll create something to classify all posts in a requested topic as "unedited", "deleted and archived", "edited within 10 minutes" or "edited after 10 minutes". But that will only be for one topic at a time, you can't easily check all posts. Another Todo: I should create this per user, that could prove very useful. Deleting a post would make that post stand out more!
|
|
|
|
Vod
Legendary
Offline
Activity: 3836
Merit: 3130
Licking my boob since 1970
|
|
June 05, 2020, 09:48:44 PM |
|
There's no way for me to know which posts have been edited. I'd have to check all 50 million posts again.
There are only 100,000 active users. Don't need to recheck posts made by inactive or banned users.
|
I post for interest - not signature spam. https://vod.fan - fast/free image sharing - coming Oct! Will Theymos finish his $100,000,000 forum before this one shuts down?
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
June 06, 2020, 07:21:32 AM |
|
There are only 100,000 active users. Don't need to recheck posts made by inactive or banned users. That could actually work, but it's still a lot of scraping to do. I won't do it, even if the average user has only 200 posts, that means scraping a million pages on a regular basis. With 5 seconds delay, it takes 2 months to find changed posts. Considering how much I scrape already, I don't think this is worth it.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
June 12, 2020, 11:40:48 AM |
|
Viewing unedited/deleted posts Question: would it be useful to add links from all archived posts to the other categories (and the other way around)? I can quite easily update the "members" and "topics" category, but updating the "posts" will be more work.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
June 21, 2020, 07:31:26 AM |
|
|
|
|
|
TheQuin
|
|
July 25, 2020, 04:13:12 AM |
|
Thanks for providing this service. It saved me a lot of time rewriting a post in a thread that got trashed yesterday.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 25, 2020, 07:24:08 AM |
|
It saved me a lot of time rewriting a post in a thread that got trashed yesterday. I even Merited that post. I did report the topic (it was on the wrong board and OP was begging), but it did have good posts. See https://loyce.club/archive/details/topic_5264143.htmlYou may have been able to recover your post from drafts too.
|
|
|
|
TheQuin
|
|
July 26, 2020, 05:48:34 AM |
|
(it was on the wrong board and OP was begging)
I didn't think about the begging angle. That explains why it was trashed rather than moved. You may have been able to recover your post from drafts too. I didn't realize they were saved for 7 days. Thanks for the tip.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 26, 2020, 10:49:59 AM |
|
I didn't realize they were saved for 7 days. Thanks for the tip. A warning though: if you post/preview a lot, you'll reach 100 drafts within a day (and the older drafts are lost).
|
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 12, 2020, 08:53:33 AM Last edit: August 12, 2020, 09:08:38 AM by LoyceV Merited by hosseinimr93 (1) |
|
Update:I finished scraping all oldposts. The last post archived as "oldpost" is Post 53360099, which was created December 16, 2019 and archived July 9, 2020. I won't update this archive anymore. I still want to add those posts to my "per topic" and "per member" lists, but I need to find some time for that. This " oldposts" archive takes 72 GB on the server. My real time posts take 14 GB. I don't really want to break existing hyperlinks to individual archived posts, but some day I may have to convert them into 100 posts per page too (that reduces disk usage a bit). I currently have over 3 million individual posts stored. I've fixed this bug: I found a bug (which I'm posting here as a reminder to myself): Posts on the עברי (Hebrew) board don't show up. Example: this post is missing, while it exists. I'll see if I can add them later. I think it has something to do with the right-to-left writing, even selecting text on that board doesn't work as expected. Update: عربية (Arabic) has the same problem. I'll re-scrape these boards after finishing scraping all posts.
|
|
|
|
Aveatrex
|
|
August 20, 2020, 08:55:49 PM |
|
Hey @LoyceV I'm currently making the chrome extension and reached the part where I need to make an XMLHTTP request to get the html of the original post from your website; I get an error: It seems like you are blocking requests, is there a specific reason why? I'm able to bypass this by using a CORS proxy, but since it's your website I would like to ask you first. Another question, even by bypassing this, it's unable to get the theymos.css file for some reason it's trying to get it at " https://bitcointalk.org/theymos.css", can you give me this css file so I can include it manually in the extension? Edit:Didn't see suchmoon's post in the first page, was able to include the theymos.css file manually thanks to him.
|
|
|
|
TryNinja
Legendary
Offline
Activity: 2968
Merit: 7397
|
|
August 20, 2020, 08:59:33 PM |
|
It seems like you are blocking requests, is there a specific reason why? I'm able to bypass this by using a CORS proxy, but since it's your website I would like to ask you first.
It's just the default behavior. I asked for the same thing before and he included the forum in the allowed CORS websites (but just for the page I needed). Another question, even by bypassing this, it's unable to get the theymos.css file for some reason it's trying to get it at " https://bitcointalk.org/theymos.css", can you give me this css file so I can include it manually in the extension? https://loyce.club/archive/posts/theymos.cssIt's loaded to his website with "../theymos.css", so it's using the relative path from the current website (bitcointalk). Just include it in the extension or replace the string to use the full path I posted above.
|
|
|
|
Aveatrex
|
|
August 20, 2020, 09:20:31 PM |
|
It seems like you are blocking requests, is there a specific reason why? I'm able to bypass this by using a CORS proxy, but since it's your website I would like to ask you first.
It's just the default behavior. I asked for the same thing before and he included the forum in the allowed CORS websites (but just for the page I needed). Then from my understanding LoyceV will have to add the entire forum host " https://bitcointalk.org/" to the allowed CORS websites for my extension to work. Hm, let's see what he says about that. It's totally possible to bypass it by using a CORS proxy like https://cors-anywhere.herokuapp.com/; you make a request to ' https://cors-anywhere.herokuapp.com/https://loyce.club/archive/posts/5190/51902990.html' instead of requesting directly " https://loyce.club/archive/posts/5190/51902990.html" and it works like a charm. Edit: Saw on your old post that it's the solution you are now using. What did you use to setup your proxy server? It's loaded to his website with "../theymos.css", so it's using the relative path from the current website (bitcointalk). Just include it in the extension or replace the string to use the full path I posted above.
Yea that was the problem thanks, I included theymos.css directly and it works now.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 21, 2020, 09:46:00 AM |
|
Then from my understanding LoyceV will have to add the entire forum host " https://bitcointalk.org/" to the allowed CORS websites for my extension to work. Hm, let's see what he says about that. This is what I said last time: I have no idea what any of this means I currently have this in apache2.conf: # Code from suchmoon, added July 8, 2020. See https://bitcointalk.org/index.php?topic=5102296.msg54755930#msg54755930 <Files "latestversion.txt"> Header set Access-Control-Allow-Origin "https://bitcointalk.org" </Files> Just tell me what to change Will this work (I haven't changed it yet)? <Directory /var/www/> Header set Access-Control-Allow-Origin "https://bitcointalk.org" </Directory> For what it's worth: I hadn't figured out the right CSS yet for posts at that time. It gets better later
|
|
|
|
Aveatrex
|
|
August 21, 2020, 03:34:55 PM |
|
I currently have this in apache2.conf: # Code from suchmoon, added July 8, 2020. See https://bitcointalk.org/index.php?topic=5102296.msg54755930#msg54755930 <Files "latestversion.txt"> Header set Access-Control-Allow-Origin "https://bitcointalk.org" </Files> From my understanding, this only allows requests to the latestversion.txt file. Since I'm doing requests to /archives/posts/*/* it doesn't work for me. Will this work (I haven't changed it yet)? <Directory /var/www/> Header set Access-Control-Allow-Origin "https://bitcointalk.org" </Directory> This *should* work because /var/www normally contains all hosted files and therefore should allow requests to all them if they are coming from bitcointalk.org. Let's give it a try
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 21, 2020, 04:55:43 PM |
|
Let's give it a try Done, test it please.
|
|
|
|
Aveatrex
|
Let's give it a try Done, test it please. Works! Thanks Here's a quick sneak peek:
Another question (sorry ), sometimes I get an 404 error when requesting for example this post, my best guess is that posts prior to a certain certain haven't been archived, is that it?
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3444
Merit: 17471
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 21, 2020, 05:58:05 PM |
|
sometimes I get an 404 error when requesting for example this post, my best guess is that posts prior to a certain certain haven't been archived, is that it? You may want to read my OP That post can be found here. It was scraped a lot later though, so it's not "unedited".
|
|
|
|
Aveatrex
|
|
August 21, 2020, 06:13:36 PM |
|
You may want to read my OP That post can be found here. It was scraped a lot later though, so it's not "unedited". I see. I'll just make it display "Original post not available" if I get a 404 error from your website.
Thank you for taking the time to respond to my questions!
|
|
|
|
|