LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 21, 2019, 05:08:59 PM Last edit: February 21, 2023, 09:23:33 AM by LoyceV Merited by 1miau (10), DdmrDdmr (5), OmegaStarScream (3), Ucy (3), Halab (2), bitmover (2), DireWolfM14 (2), Rikafip (2), TMAN (2), vapourminer (1), JayJuanGee (1), nutildah (1), TheQuin (1), Coin-1 (1), sujonali1819 (1), The Cryptovator (1), hd49728 (1), wildan88 (1), dragonvslinux (1), Steamtyme (1), FontSeli (1), lulucrypto (1), Rrita (1), 0x256 (1), kaggie (1) |
|
February 22, 2020: All updates are now live! August 12, 2020: I finished scraping all oldposts!
Ever wanted to see who's lying when a post has been edited or deleted? I may be able to help! I archive most posts within seconds after they are created (before any edits). I started this data collection around the time I started this topic. All data I have since then is available online. I also have older posts: I've saved (most) unedited posts (6.2 million posts) since September 12, 2018, until the start of this topic. This data has not been added to this topic, and I can't really add it because I tried to remove quotes and that has some bugs. You can request to dig up unedited data when needed.Viewing unedited/deleted postsHow to use itJust click one of the links, and enter the msgID, userID or topicID. Or (this older method still works): - Find the msgID, userID or topicID you need. Let's use msgID 51902990.
- Remove the last 4 digits from the msgID to get the directory name (if there are 4 or less digits, use 0): 5190.
- Put everything together behind the (above) URL and add ".html": https://loyce.club/archive/posts/5190/51902990.html.
Details- Files are stored with their msgID, userID or topicID as file name. I remove the last 4 digits to create the directory name. Each directory contains up to 10,000 HTML-files. Use CTRL-F to find what you're looking for.
- I don't scrape hidden boards (such as Investigations).
- I don't keep post titles
- I save raw HTML, including quotes
- If I run out of disk space, I might create compressed archives per 10,000 posts.
- Although I plan to preserve all data, I make no guarantees. Feel free to archive posts.
- My current (sponsored) webhost has enough storage space for years to come.
- All scrape-times use Amsterdam time (CET).
- Usually, I capture at least 99.95% of all posts. Server or internet connection problems can severely reduce this.
Examples
Older postsSneak preview: https://loyce.club/archive/oldposts/How to use: - Find the msgID you need. Let's use 28228
- Remove the last 5 digits from the msgID to get the directory name (if there are 5 or less digits, use 0): 0
- Replace the last 2 digits of the msgID by xx, and add .html (if there are 5 or less digits, use 0xx): 282xx.html
- Add "#msg" and the msgID: #msg28228
- Put everything together and go to https://loyce.club/archive/oldposts/0/282xx.html#msg28228
Limitations- Currently, the first 6.1 million posts are available.
- I'll scrape the first 5.21 million topics and all posts in there.
- That means I'll archive 53.36 million posts, this partially overlaps with my scraper for new posts.
- This is a one-time thing, I won't update it with newer posts (I scrape unedited versions for those).
- The time "scraped on" is Amsterdam time.
If no username is mentioned, it's either "Anonymous" or "random". I forgot those exist when I started scraping, and it's not important enough to start over. If anything goes wrong, let me know here.
See [overview] LoyceV's useful data on Bitcointalk for more of my forum-related topics
|
|
|
|
suchmoon
Legendary
Offline
Activity: 3794
Merit: 9018
https://bpip.org
|
quotes are not very clear
If you'd like to fix that: wrap the HTML in <div class="post">...</div> and use the following CSS: .post { color: #000000; background-color: #ECEDF3; font-size: 12px; font-family: verdana, sans-serif; margin-bottom: 5px; padding: 5px; }
.post .quoteheader { color: #476C8E; text-decoration: none; font-style: normal; font-weight: bold; font-size: 10px; line-height: 1.2em; margin-left: 6px; }
.post .quote { color: #000000; background-color: #f1f2f4; border: 1px solid #d0d0e0; padding: 5px; margin: 1px 3px 6px 6px; font-size: 11px; line-height: 1.4em; } It's by no means complete (still has problems with code tags etc) but should help with the quotes and makes it look similar to Bitcointalk styling. You can save it as a .css file and just reference in each html so space usage would be minimal and then you can adjust the CSS as needed.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 21, 2019, 06:37:54 PM |
|
If you'd like to fix that: wrap the HTML in <div class="post">...</div> and use the following CSS: Thanks!The "div class post" part is there already, I never removed it. I'll make some more adjustments, I was lazy using some headers from the forum HTML, but I'll recreate them on my own. I've named it suchmoon.css I'm creating a new post to test the new version. I'm also adding some code Update: see http://loyce.club/archive/posts/5190/51903915.html
|
|
|
|
suchmoon
Legendary
Offline
Activity: 3794
Merit: 9018
https://bpip.org
|
|
July 21, 2019, 06:40:59 PM |
|
The "div class post" part is there already, I never removed it. I'll make some more adjustments, I was lazy using some headers from the forum HTML, but I'll recreate them on my own. I've named it suchmoon.css Feel free to call it theymos.css because it's mostly stolen from here: https://bitcointalk.org/Themes/custom1/style.css You can steal borrow more stuff from the above file, like the .code class
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 21, 2019, 07:02:31 PM Last edit: July 21, 2019, 07:24:16 PM by LoyceV |
|
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 22, 2019, 07:14:16 PM Last edit: August 20, 2019, 05:47:31 PM by LoyceV |
|
|
|
|
|
Timelord2067
Legendary
Offline
Activity: 3794
Merit: 2235
💲🏎️💨🚓
|
|
July 23, 2019, 08:22:59 AM Last edit: May 16, 2023, 11:30:17 PM by Timelord2067 |
|
[quote author=LoyceV link=topic=5167469.msg51915669#msg51915669 date=1563822856] Bump! [quote author=LoyceV link=topic=4720640.msg51908047#msg51908047 date=1563775702] This morning, I checked http://loyce.club/archive/posts/members/?SD and it instantly revealed a spammer: http://loyce.club/archive/posts/members/1514722.htmlIt got me thinking: I can create a daily list of users (sorted by the number of posts they made in the past 24 hours). That would instantly highlight users who post a lot, and makes it easy to identify bump spammers. If anyone's interested to check it once in a while, I'll make it [/quote] [/quote] I thought I'd see how significant the posts step down at the 8k/4k file size: the middle one had just one post while the first and third have 29 and eight respectively. I guess the posting list will ebb and surge during holidays and work days/week-ends. Perhaps a known spammers' link be changed to another colour? (red/purple/orange etc for spammer/scammer nuked Flag etc)?? Where might we post our findings?
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 23, 2019, 08:31:44 AM |
|
I thought I'd see how significant the posts step down at the 8k/4k file size: the middle one had just one post while the first and third have 29 and eight respectively. The third one in your list has 8 posts. Those 8k/4k aren't real file sizes, I think the webserver shows the block size it uses on the file system. That means this isn't the best way to sort files. I'll add it to my TODO: create a index.html with more information. Perhaps a known spammers' link be changed to another colour? (red/purple/orange etc for spammer/scammer nuked Flag etc)?? I can strike out banned users (also on my TODO now), but since they're still posting, that won't be many users yet. Where might we post our findings? I'm not sure, maybe a separate thread?
|
|
|
|
Timelord2067
Legendary
Offline
Activity: 3794
Merit: 2235
💲🏎️💨🚓
|
|
July 23, 2019, 08:43:14 AM |
|
Instead of the 8k/4k approx file size an actual post count? Also, (am making work for you now) a sort by file name / last posted etc? Where might we post our findings? I'm not sure, maybe a separate thread? Perhaps self moderated and a simple code date: (time GMT) name+uid post count post type: scam [] one line [] signature [] bump [] people can see when the posts were last reviewed so they aren't doubling up on work?
|
|
|
|
LoyceMobile
|
|
July 23, 2019, 10:17:08 PM Last edit: July 30, 2019, 05:22:19 AM by LoyceMobile |
|
Just a thought: if I get deleted posts from modlog, I can highlight them too.
Another idea: list posts for each topicID, so it's easier to find posts that have been deleted from a certain topic.
|
|
|
|
LoyceMobile
|
|
July 28, 2019, 08:36:31 PM |
|
The member directory got messed up, I can't access my VPS from here so just don't look at it for the coming week.....
|
|
|
|
nutildah
Legendary
Offline
Activity: 3108
Merit: 8380
Happy 10th Birthday to Dogeparty!
|
|
August 01, 2019, 01:29:41 PM |
|
So, just that I understand what's going on here, you're saving the first version of every post made by everybody, ever? I don't quite get what you're doing I guess.
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 01, 2019, 01:39:26 PM Last edit: August 01, 2019, 02:15:24 PM by LoyceV |
|
you're saving the first version of every post made by everybody, ever? Correct. Update.
|
|
|
|
tranthidung
Legendary
Offline
Activity: 2394
Merit: 4230
Farewell o_e_l_e_o
|
|
August 01, 2019, 02:34:12 PM |
|
you're saving the first version of every post made by everybody, ever? Correct. Update. I think it should be the version of posts within 15 minutes (if I am remembering correctly) after published. It will be more matched with forum data. Only posts edited after 15 minutes will be shown with editing history and last editing time. are all my edits within the first 10 minutes also logged?
No, edits in the grace period are not logged. btw, is this still the same TradeFortress?
Probably. [New Feature] "Last edit" to be shown as text on mobile. FIXED! 10x Theymos:)
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 01, 2019, 02:41:14 PM |
|
I think it should be the version of posts within 15 minutes (if I am remembering correctly) after published. It will be more matched with forum data. Only posts edited after 15 minutes will be shown with editing history and last editing time. Posts can be edited for 10 minutes without showing (or even keeping!) an edit history. But I'm not aiming to match the forum, I'm aiming to show the unedited post. And I can only download posts from recent when they're new, searching for posts that are 10 minutes old will be more work.
|
|
|
|
nutildah
Legendary
Offline
Activity: 3108
Merit: 8380
Happy 10th Birthday to Dogeparty!
|
|
August 01, 2019, 02:45:07 PM |
|
Wow Loyce, you've really managed to outdo yourself this time.
Can we keep this on the downlow, I'm sure it could be quite a weapon, lol.
I'm just kidding, about the downlow part. If its a weapon everyone should have equal access to it.
Its going to be a great utility for busting liars.
|
|
|
|
tranthidung
Legendary
Offline
Activity: 2394
Merit: 4230
Farewell o_e_l_e_o
|
|
August 01, 2019, 02:48:32 PM |
|
I think it should be the version of posts within 15 minutes (if I am remembering correctly) after published. It will be more matched with forum data. Only posts edited after 15 minutes will be shown with editing history and last editing time. Posts can be edited for 10 minutes without showing (or even keeping!) an edit history. But I'm not aiming to match the forum, I'm aiming to show the unedited post. And I can only download posts from recent when they're new, searching for posts that are 10 minutes old will be more work. Ooops. I did not know that page. Checked it, and saw only ten pages available. Is it a fixed one (for all)? Or it is just a default page, and I can modify total displayed pages if I want?
|
|
|
|
LoyceV (OP)
Legendary
Offline
Activity: 3430
Merit: 17385
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
August 03, 2019, 08:50:43 AM |
|
I'm sure it could be quite a weapon, lol. If the truth can be used as a weapon against someone, he totally deserves it Its going to be a great utility for busting liars. I'm not sure yet how to keep it long-term though, it currently grows at a rate of about 2 GB and a couple of hundred thousand files per month. I'll get a bigger hosting soon, but long-term, I'm looking at some serious hosting requirements. My first priority is moving more of my data to a VPS, and get a more permanent solution (the current VPS is paid per month). Checked it, and saw only ten pages available. Is it a fixed one (for all)? I only download the first page, there's not really a need to download other pages, as long as I get the first one often enough.
Just a thought: if I get deleted posts from modlog, I can highlight them too. To answer my own suggestion: this won't work, modlog doesn't show which post was deleted.
|
|
|
|
nutildah
Legendary
Offline
Activity: 3108
Merit: 8380
Happy 10th Birthday to Dogeparty!
|
|
August 03, 2019, 08:58:57 AM |
|
I'm sure it could be quite a weapon, lol. If the truth can be used as a weapon against someone, he totally deserves it Well said. Quotable LoyceV. Of course you are inadvertently insinuating that women never lie. I'm thinking eventually it will be handy in trying to compare writing styles between users, in addition to the usual "but you originally said this" type situations. I know you could probably parse all the text from particular users from the forum itself, but in the format on your server its easier for me to attempt such a thing. I encourage you to keep it up as long as you can.
|
|
|
|
nutildah
Legendary
Offline
Activity: 3108
Merit: 8380
Happy 10th Birthday to Dogeparty!
|
|
August 07, 2019, 07:34:49 AM |
|
Hmm... Website seems to be down, might want to take a look at it...
|
|
|
|
|