Bitcoin Forum
August 07, 2024, 05:21:31 AM *
News: Latest Bitcoin Core release: 27.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 »  All
  Print  
Author Topic: Bitcointalk Search Project  (Read 453 times)
TryNinja
Legendary
*
Offline Offline

Activity: 2912
Merit: 7350


Top Crypto Casino


View Profile WWW
July 18, 2024, 05:34:56 PM
Merited by NotATether (3)
 #21

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.
You don't have to do that.

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.

█████████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████
█████████████████████████
.
BC.GAME
▄▄░░░▄▀▀▄████████
▄▄▄
██████████████
█████░░▄▄▄▄████████
▄▄▄▄▄▄▄▄▄██▄██████▄▄▄▄████
▄███▄█▄▄██████████▄████▄████
███████████████████████████▀███
▀████▄██▄██▄░░░░▄████████████
▀▀▀█████▄▄▄███████████▀██
███████████████████▀██
███████████████████▄██
▄███████████████████▄██
█████████████████████▀██
██████████████████████▄
.
..CASINO....SPORTS....RACING..
█░░░░░░█░░░░░░█
▀███▀░░▀███▀░░▀███▀
▀░▀░░░░▀░▀░░░░▀░▀
░░░░░░░░░░░░
▀██████████
░░░░░███░░░░
░░█░░░███▄█░░░
░░██▌░░███░▀░░██▌
░█░██░░███░░░█░██
░█▀▀▀█▌░███░░█▀▀▀█▌
▄█▄░░░██▄███▄█▄░░▄██▄
▄███▄
░░░░▀██▄▀


▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 18, 2024, 05:51:20 PM
Merited by seoincorporation (1)
 #22

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 18, 2024, 07:59:36 PM
Merited by seoincorporation (1)
 #23

That's a security risk though, since it would require me to also store my password in plain text.

Never store passwords in plain text.  Use a secrets manager, like AWS, to enter your password at runtime.

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
seoincorporation
Legendary
*
Offline Offline

Activity: 3234
Merit: 3028



View Profile
July 19, 2024, 03:46:24 AM
 #24

I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code:

Code:
lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0"

Code:
lynx --source "https://bitcointalk.org/index.php?topic=5503125.0"

You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.

▄▄███████▄▄
▄██████████████▄
▄██████████████████▄
▄████▀▀▀▀███▀▀▀▀█████▄
▄█████████████▄█▀████▄
███████████▄███████████
██████████▄█▀███████████
██████████▀████████████
▀█████▄█▀█████████████▀
▀████▄▄▄▄███▄▄▄▄████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀
.
 MΞTAWIN  THE FIRST WEB3 CASINO   
.
.. PLAY NOW ..
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 19, 2024, 04:01:23 AM
 #25

I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
LoyceV
Legendary
*
Offline Offline

Activity: 3388
Merit: 17126


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
July 19, 2024, 05:56:46 AM
 #26

For now, I am scraping topics from the forum using my bot.
If it helps, I can give you a tar.gz copy of my data
Sure, you can send me a copy by PM.
In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 19, 2024, 12:12:29 PM
 #27

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.



I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

It would also give me time to figure out the best setup for this.
LoyceV
Legendary
*
Offline Offline

Activity: 3388
Merit: 17126


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
July 19, 2024, 02:45:52 PM
Merited by NotATether (2)
 #28

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.
Is that a compressed tarball?
Yes.

Quote
Because for my daily backups I usually tar my folders without compression to make it go many times faster.
I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.

Quote
I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.
By all means: use my data for this Smiley Only a few posts are censored by me. Other than that, the file format is pretty much the same everywhere.

seoincorporation
Legendary
*
Offline Offline

Activity: 3234
Merit: 3028



View Profile
July 20, 2024, 01:32:42 PM
 #29

I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process.  If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area. 

Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project.

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

▄▄███████▄▄
▄██████████████▄
▄██████████████████▄
▄████▀▀▀▀███▀▀▀▀█████▄
▄█████████████▄█▀████▄
███████████▄███████████
██████████▄█▀███████████
██████████▀████████████
▀█████▄█▀█████████████▀
▀████▄▄▄▄███▄▄▄▄████▀
▀██████████████████▀
▀███████████████▀
▀▀███████▀▀
.
 MΞTAWIN  THE FIRST WEB3 CASINO   
.
.. PLAY NOW ..
TryNinja
Legendary
*
Offline Offline

Activity: 2912
Merit: 7350


Top Crypto Casino


View Profile WWW
July 20, 2024, 02:25:56 PM
 #30

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?
There is no easy solution for that. It's impossible to keep track of 55 million posts.

█████████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████
█████████████████████████
.
BC.GAME
▄▄░░░▄▀▀▄████████
▄▄▄
██████████████
█████░░▄▄▄▄████████
▄▄▄▄▄▄▄▄▄██▄██████▄▄▄▄████
▄███▄█▄▄██████████▄████▄████
███████████████████████████▀███
▀████▄██▄██▄░░░░▄████████████
▀▀▀█████▄▄▄███████████▀██
███████████████████▀██
███████████████████▄██
▄███████████████████▄██
█████████████████████▀██
██████████████████████▄
.
..CASINO....SPORTS....RACING..
█░░░░░░█░░░░░░█
▀███▀░░▀███▀░░▀███▀
▀░▀░░░░▀░▀░░░░▀░▀
░░░░░░░░░░░░
▀██████████
░░░░░███░░░░
░░█░░░███▄█░░░
░░██▌░░███░▀░░██▌
░█░██░░███░░░█░██
░█▀▀▀█▌░███░░█▀▀▀█▌
▄█▄░░░██▄███▄█▄░░▄██▄
▄███▄
░░░░▀██▄▀


▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 20, 2024, 05:03:19 PM
Last edit: July 20, 2024, 05:15:47 PM by NotATether
 #31

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well.

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

The number of new posts daily >>> The number of edited posts daily

*I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 20, 2024, 05:45:13 PM
 #32

so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 20, 2024, 05:56:48 PM
 #33

There is no way for you to track which post I may edit.  You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited.  The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

* When the initial download is finished
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 20, 2024, 06:11:28 PM
 #34

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

Manual processes will cause this project to fail - too much data.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
FatFork
Legendary
*
Offline Offline

Activity: 1680
Merit: 2632


Top Crypto Casino


View Profile WWW
July 20, 2024, 06:53:43 PM
 #35

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...

███████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████

███████████████████████
.
BC.GAME
▄▄▀▀▀▀▀▀▀▄▄
▄▀▀░▄██▀░▀██▄░▀▀▄
▄▀░▐▀▄░▀░░▀░░▀░▄▀▌░▀▄
▄▀▄█▐░▀▄▀▀▀▀▀▄▀░▌█▄▀▄
▄▀░▀░░█░▄███████▄░█░░▀░▀▄
█░█░▀░█████████████░▀░█░█
█░██░▀█▀▀█▄▄█▀▀█▀░██░█
█░█▀██░█▀▀██▀▀█░██▀█░█
▀▄▀██░░░▀▀▄▌▐▄▀▀░░░██▀▄▀
▀▄▀██░░▄░▀▄█▄▀░▄░░██▀▄▀
▀▄░▀█░▄▄▄░▀░▄▄▄░█▀░▄▀
▀▄▄▀▀███▄███▀▀▄▄▀
██████▄▄▄▄▄▄▄██████
.
..CASINO....SPORTS....RACING..
█░░░░░░█░░░░░░█
▀███▀░░▀███▀░░▀███▀
▀░▀░░░░▀░▀░░░░▀░▀
░░░░░░░░░░░░
▀██████████
░░░░░███░░░░
░░█░░░███▄█░░░
░░██▌░░███░▀░░██▌
░█░██░░███░░░█░██
░█▀▀▀█▌░███░░█▀▀▀█▌
▄█▄░░░██▄███▄█▄░░▄██▄
▄███▄
░░░░▀██▄▀


▄▄████▄▄
▄███▀▀███▄
██████████
▀███▄░▄██▀
▄▄████▄▄░▀█▀▄██▀▄▄████▄▄
▄███▀▀▀████▄▄██▀▄███▀▀███▄
███████▄▄▀▀████▄▄▀▀███████
▀███▄▄███▀░░░▀▀████▄▄▄███▀
▀▀████▀▀████████▀▀████▀▀
NeuroticFish
Legendary
*
Offline Offline

Activity: 3752
Merit: 6468


Looking for campaign manager? Contact icopress!


View Profile
July 20, 2024, 07:10:16 PM
 #36

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

█████████████████████████
████▐██▄█████████████████
████▐██████▄▄▄███████████
████▐████▄█████▄▄████████
████▐█████▀▀▀▀▀███▄██████
████▐███▀████████████████
████▐█████████▄█████▌████
████▐██▌█████▀██████▌████
████▐██████████▀████▌████
█████▀███▄█████▄███▀█████
███████▀█████████▀███████
██████████▀███▀██████████
█████████████████████████
.
BC.GAME
▄▄░░░▄▀▀▄████████
▄▄▄
██████████████
█████░░▄▄▄▄████████
▄▄▄▄▄▄▄▄▄██▄██████▄▄▄▄████
▄███▄█▄▄██████████▄████▄████
███████████████████████████▀███
▀████▄██▄██▄░░░░▄████████████
▀▀▀█████▄▄▄███████████▀██
███████████████████▀██
███████████████████▄██
▄███████████████████▄██
█████████████████████▀██
██████████████████████▄
.
..CASINO....SPORTS....RACING..
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 20, 2024, 07:42:12 PM
 #37

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

The only reason he couldn't do this would be if the indexing would slow down the site.   I wouldn't imagine any native SMF code that searches based on that key.  :/

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 28, 2024, 06:18:15 AM
 #38

My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.
Vod
Legendary
*
Offline Offline

Activity: 3780
Merit: 3107


Licking my boob since 1970


View Profile WWW
July 28, 2024, 06:19:15 AM
 #39

My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I explained how to do this five posts ago.... Smiley    ^^^

https://nastyscam.com - featuring 13 years of OGNasty bitcoin scams     https://vod.fan - advanced image hosting - coming sooner than you think!
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1680
Merit: 7118


In memory of o_e_l_e_o


View Profile WWW
July 28, 2024, 06:22:18 AM
 #40

I explained how to do this five posts ago.... Smiley    ^^^

Wow you're fast  Grin

You mean this?

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.
Pages: « 1 [2] 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!