Bitcoin Forum
October 14, 2024, 09:32:48 AM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3]  All
  Print  
Author Topic: Bitcointalk Search Project  (Read 627 times)
Vod
Legendary
*
Offline Offline

Activity: 3850
Merit: 3151


Licking my boob since 1970


View Profile WWW
July 28, 2024, 06:28:13 AM
 #41

I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.

You are still thinking of ONE parser going out pretending to be another parser.  You are fighting against every fraud detection tool out there.  

Create a schedule table in your database.   Columns include jobid, lockid, lastjob and parsedelay.   When your parser grabs a job, it locks it in the table so the next parser will grab a different job.   It releases the lock when it finishes.   Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free.

Edit:  Then go to one of the cloud providers and use a free service to create a second parser.

I post for interest - not signature spam.
https://vod.fan - fast/free image sharing - coming Oct!
Will Theymos finish his $100,000,000 forum before this one shuts down?
LoyceV
Legendary
*
Offline Offline

Activity: 3458
Merit: 17514


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
August 06, 2024, 03:14:09 PM
 #42

My scraper was broken by Cloudflare after about 58K posts or so.
If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode.

Quote
I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018.
It's in the "oldposts" directory Smiley

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

Vod
Legendary
*
Offline Offline

Activity: 3850
Merit: 3151


Licking my boob since 1970


View Profile WWW
August 06, 2024, 10:52:46 PM
 #43

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).

I use multiple parsers for backup - if one goes down for whatever reason a second one can take over.  90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP.  I parse once every ten seconds to check for any new posts, and if any I parse them.  My record locking system has a parse delay for many things to prevent it from hitting bct too often.  I don't even parse as a logged in user.

I post for interest - not signature spam.
https://vod.fan - fast/free image sharing - coming Oct!
Will Theymos finish his $100,000,000 forum before this one shuts down?
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1750
Merit: 7326


In memory of o_e_l_e_o


View Profile WWW
September 19, 2024, 07:26:09 AM
 #44

Bump. I just got a new server for the scraper along with a bunch of HTTP proxies.

Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs?
The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second).
The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)

This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.

BlackHatCoiner
Legendary
*
Offline Offline

Activity: 1666
Merit: 8227


Bitcoin is a royal fork


View Profile WWW
September 21, 2024, 08:09:28 PM
 #45

Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1750
Merit: 7326


In memory of o_e_l_e_o


View Profile WWW
September 28, 2024, 11:16:32 AM
 #46

Scraping has resumed

(I am respecting Bitcointalk.org rate limits)

Vod
Legendary
*
Offline Offline

Activity: 3850
Merit: 3151


Licking my boob since 1970


View Profile WWW
September 30, 2024, 12:39:43 AM
 #47

Just allow us to run any SELECT queries we want and you will have covered every potential feature!

No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that...

Code:
CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass';
GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost';
FLUSH PRIVILEGES;

 Grin

'Tis a good idea - but beware SQL injection!

I post for interest - not signature spam.
https://vod.fan - fast/free image sharing - coming Oct!
Will Theymos finish his $100,000,000 forum before this one shuts down?
NotATether (OP)
Legendary
*
Offline Offline

Activity: 1750
Merit: 7326


In memory of o_e_l_e_o


View Profile WWW
October 01, 2024, 03:15:46 PM
 #48

In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this:

'Tis a good idea - but beware SQL injection!

But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour.

At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.

NotATether (OP)
Legendary
*
Offline Offline

Activity: 1750
Merit: 7326


In memory of o_e_l_e_o


View Profile WWW
October 06, 2024, 04:31:04 AM
 #49

Damn, I just realized that threads like this with only images on them are unscrapeable.

As in, the topic is indexed, but since there is no text, the message contents are all empty.

I can't figure out how one would effectively make a search by an image.

On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.

LoyceV
Legendary
*
Offline Offline

Activity: 3458
Merit: 17514


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
October 07, 2024, 10:06:38 AM
 #50

I can't figure out how one would effectively make a search by an image.
I'd say just search the text:
Code:
[img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img]
So a search for "Anita calculators" should pop up this post (Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.

Pages: « 1 2 [3]  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!