TryNinja
Legendary
Offline
Activity: 3024
Merit: 7443
Top Crypto Casino
|
|
July 18, 2024, 05:34:56 PM Merited by NotATether (3) |
|
The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.
You don't have to do that. If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this. For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 18, 2024, 05:51:20 PM |
|
If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.
I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 18, 2024, 07:59:36 PM |
|
That's a security risk though, since it would require me to also store my password in plain text.
Never store passwords in plain text. Use a secrets manager, like AWS, to enter your password at runtime.
|
|
|
|
seoincorporation
Legendary
Offline
Activity: 3346
Merit: 3125
|
|
July 19, 2024, 03:46:24 AM |
|
I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code: lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0" lynx --source "https://bitcointalk.org/index.php?topic=5503125.0" You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 19, 2024, 04:01:23 AM |
|
I'm curious about the method that you are using to get the data from each thread
You understand how we get the info - just not how we know what info to get. It depends on what data you want to process. If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area.
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3500
Merit: 17698
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 19, 2024, 05:56:46 AM |
|
For now, I am scraping topics from the forum using my bot. If it helps, I can give you a tar.gz copy of my data Sure, you can send me a copy by PM. In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
|
| | Peach BTC bitcoin | │ | Buy and Sell Bitcoin P2P | │ | . .
▄▄███████▄▄ ▄██████████████▄ ▄███████████████████▄ ▄█████████████████████▄ ▄███████████████████████▄ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ ▀█████████████████████▀ ▀███████████████████▀ ▀███████████████▀ ▀▀███████▀▀
▀▀▀▀███████▀▀▀▀ | | EUROPE | AFRICA LATIN AMERICA | | | ▄▀▀▀ █ █ █ █ █ █ █ █ █ █ █ ▀▄▄▄ |
███████▄█ ███████▀ ██▄▄▄▄▄░▄▄▄▄▄ █████████████▀ ▐███████████▌ ▐███████████▌ █████████████▄ ██████████████ ███▀███▀▀███▀ | . Download on the App Store | ▀▀▀▄ █ █ █ █ █ █ █ █ █ █ █ ▄▄▄▀ | ▄▀▀▀ █ █ █ █ █ █ █ █ █ █ █ ▀▄▄▄ |
▄██▄ ██████▄ █████████▄ ████████████▄ ███████████████ ████████████▀ █████████▀ ██████▀ ▀██▀ | . GET IT ON Google Play | ▀▀▀▄ █ █ █ █ █ █ █ █ █ █ █ ▄▄▄▀ |
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 19, 2024, 12:12:29 PM |
|
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.
I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this. It would also give me time to figure out the best setup for this.
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3500
Merit: 17698
Thick-Skinned Gang Leader and Golden Feather 2021
|
|
July 19, 2024, 02:45:52 PM Merited by NotATether (2) |
|
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read. Is that a compressed tarball? Yes. Because for my daily backups I usually tar my folders without compression to make it go many times faster. I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this. By all means: use my data for this Only a few posts are censored by me. Other than that, the file format is pretty much the same everywhere.
|
| | Peach BTC bitcoin | │ | Buy and Sell Bitcoin P2P | │ | . .
▄▄███████▄▄ ▄██████████████▄ ▄███████████████████▄ ▄█████████████████████▄ ▄███████████████████████▄ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ ▀█████████████████████▀ ▀███████████████████▀ ▀███████████████▀ ▀▀███████▀▀
▀▀▀▀███████▀▀▀▀ | | EUROPE | AFRICA LATIN AMERICA | | | ▄▀▀▀ █ █ █ █ █ █ █ █ █ █ █ ▀▄▄▄ |
███████▄█ ███████▀ ██▄▄▄▄▄░▄▄▄▄▄ █████████████▀ ▐███████████▌ ▐███████████▌ █████████████▄ ██████████████ ███▀███▀▀███▀ | . Download on the App Store | ▀▀▀▄ █ █ █ █ █ █ █ █ █ █ █ ▄▄▄▀ | ▄▀▀▀ █ █ █ █ █ █ █ █ █ █ █ ▀▄▄▄ |
▄██▄ ██████▄ █████████▄ ████████████▄ ███████████████ ████████████▀ █████████▀ ██████▀ ▀██▀ | . GET IT ON Google Play | ▀▀▀▄ █ █ █ █ █ █ █ █ █ █ █ ▄▄▄▀ |
|
|
|
seoincorporation
Legendary
Offline
Activity: 3346
Merit: 3125
|
|
July 20, 2024, 01:32:42 PM |
|
I'm curious about the method that you are using to get the data from each thread
You understand how we get the info - just not how we know what info to get. It depends on what data you want to process. If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project. We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
|
|
|
|
TryNinja
Legendary
Offline
Activity: 3024
Merit: 7443
Top Crypto Casino
|
|
July 20, 2024, 02:25:56 PM |
|
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
There is no easy solution for that. It's impossible to keep track of 55 million posts.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 20, 2024, 05:03:19 PM Last edit: July 20, 2024, 05:15:47 PM by NotATether |
|
We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well. Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions. The number of new posts daily >>> The number of edited posts daily *I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 20, 2024, 05:45:13 PM |
|
so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.
There is no way for you to track which post I may edit. You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited. The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 20, 2024, 05:56:48 PM |
|
There is no way for you to track which post I may edit. You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited. The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.
That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting. It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of. * When the initial download is finished
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 20, 2024, 06:11:28 PM |
|
That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.
Manual processes will cause this project to fail - too much data. Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
|
|
|
|
FatFork
Legendary
Offline
Activity: 1792
Merit: 2675
Crypto Swap Exchange
|
|
July 20, 2024, 06:53:43 PM |
|
It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.
I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...
|
|
|
|
NeuroticFish
Legendary
Offline
Activity: 3864
Merit: 6596
Looking for campaign manager? Contact icopress!
|
|
July 20, 2024, 07:10:16 PM |
|
Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.
There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)? I mean that the forum knows them, it only has to offer them somehow.
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 20, 2024, 07:42:12 PM |
|
There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)? I mean that the forum knows them, it only has to offer them somehow.
The only reason he couldn't do this would be if the indexing would slow down the site. I wouldn't imagine any native SMF code that searches based on that key. :/
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 28, 2024, 06:18:15 AM |
|
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.
I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.
|
|
|
|
Vod
Legendary
Offline
Activity: 3892
Merit: 3166
Licking my boob since 1970
|
|
July 28, 2024, 06:19:15 AM |
|
My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.
I explained how to do this five posts ago.... ^^^
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 1792
Merit: 7389
Top Crypto Casino
|
|
July 28, 2024, 06:22:18 AM |
|
I explained how to do this five posts ago.... ^^^ Wow you're fast You mean this? Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.
|
|
|
|
|