Bitcoin Forum

Bitcoin => Project Development => Topic started by: NotATether on July 16, 2024, 08:01:35 AM

Title: Bitcointalk Search Project
Post by: NotATether on July 16, 2024, 08:01:35 AM

I am trying to make a search engine for Bitcointalk posts, since Google and the built-in one are so bad.

List all the features you want in a search engine here.

For now, I am scraping topics from the forum using my bot. I made sure to identify the requests as coming from me in my program so that the admins know where this traffic is coming from.

It doesn't look like it's exceeding the threshold of one request per second so that's good.

Private boards are not being scraped. The scraping is being done as a guest.

Title: Re: Bitcointalk Search Project
Post by: ABCbits on July 16, 2024, 08:46:29 AM

Quote from: NotATether on July 16, 2024, 08:01:35 AM

List all the features you want in a search engine here.

How about feature which already available on https://ninjastic.space/search (https://ninjastic.space/search)? Aside from that, i would suggest these feature.
1. Sort by relevancy.
2. Showing message that the search keyword may contain typo (such as showing "bitcoin" when someone enter "bitcon").

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 16, 2024, 09:26:57 AM

Quote from: ABCbits on July 16, 2024, 08:46:29 AM

Quote from: NotATether on July 16, 2024, 08:01:35 AM

List all the features you want in a search engine here.

Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

It needs to show only an excerpt with a link and title like forum search and Google do, it needs to have page numbers for browsing the results by page and most importantly it should not be looking inside quotes for keywords.

Title: Re: Bitcointalk Search Project
Post by: LoyceV on July 16, 2024, 09:27:46 AM

Quote from: NotATether on July 16, 2024, 08:01:35 AM

For now, I am scraping topics from the forum using my bot.

If it helps, I can give you a tar.gz copy of my data (note: some posts are missing (https://bitcointalk.org/index.php?topic=5167469.msg63446454#msg63446454)). I shared it with Ninjastic years ago, and it saves you several months of scraping. Freshly scraping will get you a more recent edit though, and less deleted posts.

Quote from: ABCbits on July 16, 2024, 08:46:29 AM

1. Sort by relevancy.

This would be the one thing I'd like to see, but also no doubt the most difficult one. Ninjastic often gives me a list of hundreds of posts. A good search engine (like Google 10 years ago) would show what I want to see first.

Title: Re: Bitcointalk Search Project
Post by: ABCbits on July 16, 2024, 10:19:21 AM

Quote from: NotATether on July 16, 2024, 09:26:57 AM

Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

Sorry for not being specific. I mean feature such as "Date Range (UTC)" filter, choosing one or more boards (and optionally with the child board) and sign support (+, -, | and "").

Title: Re: Bitcointalk Search Project
Post by: mocacinno on July 16, 2024, 10:49:26 AM

if you are scraping and parsing anyway, it would be nice if your search engine was indexing the most common board objects... For example, the username, DT rank, feedback, boards,... That way, you could use keywords, like you can in google (filetype:, site:,...).

If would be nice if i could make a query like `user:Theymos board:Bitcoin\Project_Development +wallet -knots taproot` and i would only see posts made by Theymos in the project developent board that contained the word wallet, did not contain the word knots and hopefully contained the word taproot.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 16, 2024, 12:03:39 PM

Quote from: LoyceV on July 16, 2024, 09:27:46 AM

Quote from: NotATether on July 16, 2024, 08:01:35 AM

For now, I am scraping topics from the forum using my bot.

Sure, you can send me a copy by PM.

Quote from: mocacinno on July 16, 2024, 10:49:26 AM

I can't index DT information since it's invisible to guests and it changes too quickly anyway, but I'm already scraping the other stuff like boards, username (of course) etcetera. Even the user IDs are being scraped to help deal with name changes.

My bot can also handle anonymous users too.

Title: Re: Bitcointalk Search Project
Post by: seoincorporation on July 16, 2024, 02:07:18 PM

Quote from: mocacinno on July 16, 2024, 10:49:26 AM

But this wouldn't be like rebuilding the full forum on a database?

I mean, there are 2 ways to do this:

1.- You take all the forum data, and put it together on a database and then your search engine makes calls to that database. But for this, you will have to live update that database or at least have a cron job to add the new data each x time.

2.- Search for the data directly on the site, but for that, you would have to do some kind of hack to the current search engine.

If you have other way in mind i would love to know how it work.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 16, 2024, 06:16:47 PM

Quote from: NotATether on July 16, 2024, 09:26:57 AM

it should not be looking inside quotes for keywords.

This is what I'm currently having issues with. People break the BB quote code all the time in their posts.

Title: Re: Bitcointalk Search Project
Post by: mocacinno on July 17, 2024, 05:34:14 AM

Quote from: seoincorporation on July 16, 2024, 02:07:18 PM

Quote from: mocacinno on July 16, 2024, 10:49:26 AM

It would be better if there was a way to improve SMF's search function or query the relational database directly, but i don't know if Theymos would give anybody direct access to the database or allow anybody to completely rework smf's soucecode .
I don't think he would, and for good reason offcourse... It would require absolute trust in the person building the search engine.

But you're right, it would be completely rebuilding bitcointalk's database, like several other members are doing aswell (more or less)..

Whenever i see somebody building extensions, offsite tools, proposing changes to SMF, i can't help but wonder how epochtalk is doing, and if epochtalk would solve the problem without requiring browser plugins, offsite tools, scraping,... Don't get me wrong: the current forum software lacks several features, and i'm happy if somebody builds them (even if it's on a different domain, or requires me to install a browser plugin), i just wonder wether we'll ever switch to the new forum software.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 17, 2024, 03:40:58 PM

This thing is going to use Elasticsearch, so if I can figure out how to handle the multi-lingual text, it may possibly support searching in different languages too.

Title: Re: Bitcointalk Search Project
Post by: TryNinja on July 17, 2024, 04:36:44 PM

Quote from: NotATether on July 16, 2024, 09:26:57 AM

Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

Soon™ ... ;)

https://www.talkimg.com/images/2024/07/17/4n6S1.png

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 17, 2024, 04:57:20 PM

Quote from: TryNinja on July 17, 2024, 04:36:44 PM

Quote from: NotATether on July 16, 2024, 09:26:57 AM

Ninjastic is showing entire posts so it's impossible to find anything meaningful when you search for a keyword.

Soon™ ... ;)

https://www.talkimg.com/images/2024/07/17/4n6S1.png

Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication, so gives the quoted parts more weight in the search results - meaning the posts that are quoted the most are also the most likely to be returned at the top (most likely in the form of some person's reply to it).

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.

Title: Re: Bitcointalk Search Project
Post by: LoyceV on July 17, 2024, 05:04:35 PM

Quote from: NotATether on July 17, 2024, 04:57:20 PM

Personally I am against indexing quotes inside posts, since that runs into the risk of corpus duplication

In most cases, I agree. But if the quote comes from an external website, the content can still be relevant to the search query.

Title: Re: Bitcointalk Search Project
Post by: TryNinja on July 17, 2024, 05:24:50 PM

Quote from: NotATether on July 17, 2024, 04:57:20 PM

I've separated quotes from post content.

That way there is:

- post content
- quotes

Quote from: NotATether on July 17, 2024, 04:57:20 PM

There's also the issue with nested quotes, which is hell to deal with using a database, and even Elasticsearch too. It can lead to infinitely recursive schema/JSON before you can even parse it completely.

Even if there are nested quotes, they are treated individually and also indexed as their own.

Take this post of mine for example. There is the content (everything that is NOT a quote of another user, like this text itself) and both quotes from author NotATether you can see above.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 17, 2024, 07:28:12 PM

Quote from: NotATether on July 16, 2024, 12:03:39 PM

You CAN index DT information by running your parser under your account. See https://bitcointalk.org/captcha_code.php (it no longer works for my account but yours is probably fine)

Quote from: TryNinja on July 17, 2024, 05:24:50 PM

Even if there are nested quotes, they are treated individually and also indexed as their own.

You will be able to search only the content, only the quotes, quotes from X user, or both the content and quotes.

You must have some insane AIish parsing going. Many posts have broken quote html. :(

Title: Re: Bitcointalk Search Project
Post by: TryNinja on July 17, 2024, 07:32:01 PM

Quote from: Vod on July 17, 2024, 07:28:12 PM

You must have some insane AIish parsing going. Many posts have broken quote html. :(

I don't, in this case there isn't much to be done... If it's broken, it's broken.

But most posts don't have this problem, so it will be a lot better than what I currently have on ninjastic.space.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 17, 2024, 08:41:48 PM

Quote from: TryNinja on July 17, 2024, 07:32:01 PM

so it will be a lot better than what I currently have on ninjastic.space.

I watch your work with interest - I love large datasets and you seem to know the presentation layer well!

And then there is LoyceV - he has all the information in text format and is a whiz with queries, but he cannot present the info like you do. Loyce.club can be the fastest to find exactly what one is was looking for, but I work with your website if I only have partial info.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 18, 2024, 11:23:55 AM

15,000 out of 1.7 million threads scraped so far, all topics being scraped in numerical order.

Quote from: Vod on July 17, 2024, 07:28:12 PM

Quote from: NotATether on July 16, 2024, 12:03:39 PM

You CAN index DT information by running your parser under your account. See https://bitcointalk.org/captcha_code.php (it no longer works for my account but yours is probably fine)

That's a security risk though, since it would require me to also store my password in plain text.

Even I use one of the bots (BotATether or Jarvis), the added hassle of dealing with authentication will actually slow down post collection. Currently for each thread I'm launching a new browser - this helps me stay within the rate limits.

I might have to do a separate scrape for users specifically, to get everybody's DT information without duplicating stuff. But I don't think that's going to be happening anytime soon.

Title: Re: Bitcointalk Search Project
Post by: seoincorporation on July 18, 2024, 02:09:58 PM

Quote from: mocacinno on July 17, 2024, 05:34:14 AM

Quote from: seoincorporation on July 16, 2024, 02:07:18 PM

Quote from: mocacinno on July 16, 2024, 10:49:26 AM

If anyone has access to implement a change like this in the forum, that one is our verified hacker PowerGlove (https://bitcointalk.org/index.php?action=profile;u=3486361), but the fact that the search function doesn't work fine at all must be for a reason. Maybe the forum used to have some kind of attacks from that vector.

This project would be easy if the RSS was still active on the forum, but sadly it has been removed:

https://bitcointalk.org/index.php?type=rss;action=.xml

Quote

action=.xml is disabled due to slowness. If you use this, write a post in Meta explaining your usage.

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.

Title: Re: Bitcointalk Search Project
Post by: TryNinja on July 18, 2024, 05:34:56 PM

Quote from: seoincorporation on July 18, 2024, 02:09:58 PM

The problem is the moment to index the new threads, without RSS we have to "Ping" each board to know if there is new content.

You don't have to do that.

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

For my scraper I did more of a workaround checking the url of the most recent posts to see if there might be zero replies on its thread, which would imply it is an OP.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 18, 2024, 05:51:20 PM

Quote from: TryNinja on July 18, 2024, 05:34:56 PM

If you keep track of every new topic, all you need to do is add +1 to the ID and see if it exists. Some problems might arrive, like the thread being deleted before you check, meaning that the last ID has changed, but there are ways of minimizing this.

I could try checking the next sequence of 100 posts or so, in order to check for new posts, since it's extremely unlikely that they were all deleted.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 18, 2024, 07:59:36 PM

Quote from: NotATether on July 18, 2024, 11:23:55 AM

That's a security risk though, since it would require me to also store my password in plain text.

Never store passwords in plain text. Use a secrets manager, like AWS, to enter your password at runtime.

Title: Re: Bitcointalk Search Project
Post by: seoincorporation on July 19, 2024, 03:46:24 AM

I'm curious about the method that you are using to get the data from each thread, there is a command on Linux called lynx, it is a web browser for the command line, and with that, you can get the text from a website or the source code:

Code:

lynx --dump "https://bitcointalk.org/index.php?topic=5503125.0"

Code:

lynx --source "https://bitcointalk.org/index.php?topic=5503125.0"

You could use some tools like cut and grep to get only the relative data. Making the script would be the easy part, getting the data from 5.5 million of threads will be the hard part, lol And the fact that each thread could have multiple pages makes it a challenge.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 19, 2024, 04:01:23 AM

Quote from: seoincorporation on July 19, 2024, 03:46:24 AM

I'm curious about the method that you are using to get the data from each thread

You understand how we get the info - just not how we know what info to get.

It depends on what data you want to process. If you just want to capture each thread, parse the front page every 10 seconds and grab the link from the "Newest Posts" area.

Title: Re: Bitcointalk Search Project
Post by: LoyceV on July 19, 2024, 05:56:46 AM

Quote from: NotATether on July 16, 2024, 12:03:39 PM

Quote from: LoyceV on July 16, 2024, 09:27:46 AM

Quote from: NotATether on July 16, 2024, 08:01:35 AM

For now, I am scraping topics from the forum using my bot.

If it helps, I can give you a tar.gz copy of my data

Sure, you can send me a copy by PM.

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 19, 2024, 12:12:29 PM

Quote from: LoyceV on July 19, 2024, 05:56:46 AM

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball? Because for my daily backups I usually tar my folders without compression to make it go many times faster.

I'm considering putting together a website and releasing the search, but with incomplete data for now. So far, I have over 23k posts (up to roughly July 2011), but I would not like to wait months before people can actually use this.

It would also give me time to figure out the best setup for this.

Title: Re: Bitcointalk Search Project
Post by: LoyceV on July 19, 2024, 02:45:52 PM

Quote from: NotATether on July 19, 2024, 12:12:29 PM

Quote from: LoyceV on July 19, 2024, 05:56:46 AM

In case you're (both) wondering: I'm still working on it. Creating a tar on a spinning disk with 50 million files was a bit of a mistake. It's fast to write but takes days to read.

Is that a compressed tarball?

Yes.

Quote

Because for my daily backups I usually tar my folders without compression to make it go many times faster.

I'm using pigz, which now only uses 1% of 1 CPU core. Reading from disk is the limitation. I thought I'd do you a favour by making one file instead of giving you half a million compressed files.

Quote

By all means: use my data for this :) Only a few posts are censored (https://bitcointalk.org/index.php?topic=5167469.0;all) by me. Other than that, the file format is pretty much the same everywhere.

Title: Re: Bitcointalk Search Project
Post by: seoincorporation on July 20, 2024, 01:32:42 PM

Quote from: Vod on July 19, 2024, 04:01:23 AM

Quote from: seoincorporation on July 19, 2024, 03:46:24 AM

I'm curious about the method that you are using to get the data from each thread

Ok, then what would happen if a user edited his post once it was recorded in the search engine Database? That way will be impossible to search for new information on edited posts, and i feel like that will be a problem for this project.

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

Title: Re: Bitcointalk Search Project
Post by: TryNinja on July 20, 2024, 02:25:56 PM

Quote from: seoincorporation on July 20, 2024, 01:32:42 PM

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

There is no easy solution for that. It's impossible to keep track of 55 million posts.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 20, 2024, 05:03:19 PM

Quote from: seoincorporation on July 20, 2024, 01:32:42 PM

We have seen a lot of threads with the word "Reseved" to get edited in the future. So, here we have another challenge for the project. How to deal with edited posts?

This is going to be a live search engine, so every post is going to be kept up to date, and removed if the original post is removed as well.

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

The number of new posts daily >>> The number of edited posts daily

*I do not currently track the "last edited" time because it is an unreliable indicator for determining whether a given post might be edited in the future.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 20, 2024, 05:45:13 PM

Quote from: NotATether on July 20, 2024, 05:03:19 PM

so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There is no way for you to track which post I may edit. You would need to rescan all my 22k posts on a regular basis, or use some AI to determine what I may have edited. The number of seconds in a day is limited to about 4x my number of posts, and I'm sure there are more than three other people posting.

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 20, 2024, 05:56:48 PM

Quote from: Vod on July 20, 2024, 05:45:13 PM

That's what I'm saying. I will have to manually go through all the user's posts (but hey, at least I'll have their message and topic IDs this time!) but at least I know that statistically, I will only need to search a few users at a time, because editing is infrequent compared to posting.

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

* When the initial download is finished

Title: Re: Bitcointalk Search Project
Post by: Vod on July 20, 2024, 06:11:28 PM

Quote from: NotATether on July 20, 2024, 05:56:48 PM

Manual processes will cause this project to fail - too much data.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

Title: Re: Bitcointalk Search Project
Post by: FatFork on July 20, 2024, 06:53:43 PM

Quote from: NotATether on July 20, 2024, 05:56:48 PM

It's not perfect obviously - edited posts probably won't be indexed for several hours like that* - but it's the best I can think of.

I also think that it won't be feasible, at least not with a single scraper. As Vod said, his post history exceeds 21,000 entries. Continuously monitoring all his posts for edits would be very resource-intensive and/or time-consuming. And what about other members with even more activity, like philipma1957, BADecker, JayJuanGee, franky1, and others? We're talking hundreds of thousands of posts that would need parsing every day...

Title: Re: Bitcointalk Search Project
Post by: NeuroticFish on July 20, 2024, 07:10:16 PM

Quote from: NotATether on July 20, 2024, 05:03:19 PM

Edit: I will implement a heuristic that tracks when the last time a person logged into their account, and correlates that to the frequency of posts they made on particular days versus now, Banned users and inactive accounts that haven't posted for 120 days (or whatever the "This account recently woke up from a long period of inactivity" threshold is) can be excluded, so this leaves a small subset of users who will might actually make an edit, out of the number daily active users, whose user IDs should then be prioritized to scan for edits and deletions.

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 20, 2024, 07:42:12 PM

Quote from: NeuroticFish on July 20, 2024, 07:10:16 PM

There's no option to ask Theymos to offer - in a way or another - the list of "last" edited posts (maybe after a date)?
I mean that the forum knows them, it only has to offer them somehow.

The only reason he couldn't do this would be if the indexing would slow down the site. I wouldn't imagine any native SMF code that searches based on that key. :/

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 28, 2024, 06:18:15 AM

My scraper was broken by Cloudflare after about 58K posts or so. I can use proxies to get around it (while keeping the same rate limit), but I will need to figure out how to integrate that into the code.

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 28, 2024, 06:19:15 AM

Quote from: NotATether on July 28, 2024, 06:18:15 AM

I explained how to do this five posts ago.... :) ^^^

Title: Re: Bitcointalk Search Project
Post by: NotATether on July 28, 2024, 06:22:18 AM

Quote from: Vod on July 28, 2024, 06:19:15 AM

I explained how to do this five posts ago.... :) ^^^

Wow you're fast ;D

You mean this?

Quote from: Vod on July 20, 2024, 06:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

Title: Re: Bitcointalk Search Project
Post by: Vod on July 28, 2024, 06:28:13 AM

Quote from: NotATether on July 28, 2024, 06:22:18 AM

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser. You are fighting against every fraud detection tool out there.

Create a schedule table in your database. Columns include jobid, lockid, lastjob and parsedelay. When your parser grabs a job, it locks it in the table so the next parser will grab a different job. It releases the lock when it finishes. Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit: Then go to one of the cloud providers and use a free service to create a second parser.

Title: Re: Bitcointalk Search Project
Post by: LoyceV on August 06, 2024, 03:14:09 PM

Quote from: NotATether on July 28, 2024, 06:18:15 AM

My scraper was broken by Cloudflare after about 58K posts or so.

If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Quote

I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.

It's in the "oldposts" directory :)

Quote from: Vod on July 20, 2024, 06:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

Quote from: theymos on February 12, 2015, 10:03:56 PM

The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

Title: Re: Bitcointalk Search Project
Post by: Vod on August 06, 2024, 10:52:46 PM

Quote from: LoyceV on August 06, 2024, 03:14:09 PM

Quote from: Vod on July 20, 2024, 06:11:28 PM

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?

The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over. 90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP. I parse once every ten seconds to check for any new posts, and if any I parse them. My record locking system has a parse delay for many things to prevent it from hitting bct too often. I don't even parse as a logged in user.