Vod
Legendary
Offline
Activity: 4298
Merit: 3377
Licking my boob since 1970
|
 |
July 28, 2024, 06:28:13 AM |
|
I haven't really done such a thing before. But like I said, I have a few IP addresses, so I guess I'll see how that goes.
You are still thinking of ONE parser going out pretending to be another parser. You are fighting against every fraud detection tool out there. Create a schedule table in your database. Columns include jobid, lockid, lastjob and parsedelay. When your parser grabs a job, it locks it in the table so the next parser will grab a different job. It releases the lock when it finishes. Your parser can call the first record in the schedule based on (lastjob+parsedelay) where lockid is free. Edit: Then go to one of the cloud providers and use a free service to create a second parser.
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
August 06, 2024, 03:14:09 PM |
|
My scraper was broken by Cloudflare after about 58K posts or so. If you ask nicely, maybe theymos can whitelist your server IP in Cloudflare. That solved my download problems when Cloudflare goes in full DDoS protection mode. I do however have LoyceV's archive (thanks Loyce) But I am not sure whether it covers posts before 2018. It's in the "oldposts" directory  Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs? The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second). The rules are the same as for humans. But keep in mind: - No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
Vod
Legendary
Offline
Activity: 4298
Merit: 3377
Licking my boob since 1970
|
 |
August 06, 2024, 10:52:46 PM |
|
Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs? The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second). I use multiple parsers for backup - if one goes down for whatever reason a second one can take over. 90% of the time, my parsers have nothing to do, since I'm not parsing every profile like I did with BPIP. I parse once every ten seconds to check for any new posts, and if any I parse them. My record locking system has a parse delay for many things to prevent it from hitting bct too often. I don't even parse as a logged in user.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
September 19, 2024, 07:26:09 AM |
|
Bump. I just got a new server for the scraper along with a bunch of HTTP proxies. Why don't you implement a record locking system into your parser, so you can have multiple parsers running at once from various IPs? The rate limit is supposed to be per person, not per server. You shouldn't use multiple scrapers to get around the limit (1 connection per second). The rules are the same as for humans. But keep in mind: - No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.) This bold part should solve my problem with endless Cloudflare captchas. Besides, the parser is bound to take breaks once I catch up with the database downloads and it is limited to checking for new posts and edits to old posts. I wish there was a mod log that contained edited post events.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
BlackHatCoiner
Legendary
Offline
Activity: 1890
Merit: 9199
Bitcoin is ontological repair
|
 |
September 21, 2024, 08:09:28 PM |
|
Just allow us to run any SELECT queries we want and you will have covered every potential feature! No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that... CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass'; GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost'; FLUSH PRIVILEGES;

|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
September 28, 2024, 11:16:32 AM |
|
Scraping has resumed
(I am respecting Bitcointalk.org rate limits)
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
Vod
Legendary
Offline
Activity: 4298
Merit: 3377
Licking my boob since 1970
|
 |
September 30, 2024, 12:39:43 AM |
|
Just allow us to run any SELECT queries we want and you will have covered every potential feature! No seriously. Just load the database with all the posts, and allow the user to execute any SELECT query they want and you will have covered everything we might ask. Maybe add a nice UI for basic search requests, like author and subject search, but beyond that... CREATE USER 'readonly'@'localhost' IDENTIFIED BY 'pass'; GRANT SELECT ON bitcointalk_database.* TO 'readonly'@'localhost'; FLUSH PRIVILEGES;
 'Tis a good idea - but beware SQL injection!
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 01, 2024, 03:15:46 PM |
|
In all seriousness though, I'm not in favor of exposing SQL directly, not only because of this: 'Tis a good idea - but beware SQL injection!
But also because it would be a huge performance issue very fast. Captchas will not protect the database much, as there are bot-solvers you can rent for sats per hour. At any rate, the data was always going to be placed in an Elasticsearch cluster, not an SQL database. Searching inside paragraphs using SELECT is so difficult that it might not even be possible.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 06, 2024, 04:31:04 AM |
|
Damn, I just realized that threads like this with only images on them are unscrapeable. As in, the topic is indexed, but since there is no text, the message contents are all empty. I can't figure out how one would effectively make a search by an image. On a side-note, I am approaching Wall Observer-sized threads. Let's see if my scraper can grok them. It can swallow threads with a thousand or so pages, but I've never tested with tens of thousands.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
October 07, 2024, 10:06:38 AM |
|
I can't figure out how one would effectively make a search by an image. I'd say just search the text: [img]http://www.anita-calculators.info/assets/images/Anita841_5.jpg[/img] So a search for "Anita calculators" should pop up this post ( Ninjastic can't find it), but also "Anita841_5" or better "Anita841". Anyone searching for the word "images" is on his own, but finding the right posts when you search for "assets" is probably going to be a challenge.
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 17, 2024, 06:46:08 AM Last edit: October 17, 2024, 07:26:47 AM by NotATether |
|
I crossed 100K posts threads scraped a few days ago.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
BlackHatCoiner
Legendary
Offline
Activity: 1890
Merit: 9199
Bitcoin is ontological repair
|
 |
October 17, 2024, 07:02:07 AM |
|
I crossed 100K posts scraped a few days ago.
Wouldn't it take years until you reach the 60 millionth posts?
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
October 17, 2024, 07:18:05 AM |
|
I crossed 100K posts scraped a few days ago. If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts?
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 17, 2024, 07:28:01 AM |
|
I crossed 100K posts scraped a few days ago. If you're doing this at 1 request per second, and started 3 months ago, that's almost 8 million requests. There are up to 20 posts per page. Did you mean 100M posts? My bad guys, I meant to write 100K threads.  Bad typo. But at the rate this is going, it's going to be scraping like 70k posts a month. So I better start processing that archive you gave me.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
October 17, 2024, 07:46:49 AM |
|
My bad guys, I meant to write 100K threads.  Bad typo. But at the rate this is going, it's going to be scraping like 70k posts a month. Those numbers still don't add up. Less than one post per thread?
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 17, 2024, 08:11:29 AM |
|
Those numbers still don't add up. Less than one post per thread?
Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean  So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page. Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster. About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on. Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k pages per day. Including the 'Not Found' pages as well. Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this): 
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
October 17, 2024, 08:28:12 AM |
|
Due to rate limiting constraints, it can only navigate to about 2.5k It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done? At 70k threads per month, it's going to take you 6.5 years to scrape all 5.5 million threads. I remember it took me about 4 months scraping all the old posts, and that was only up to 2019. If you go from 10 to 1 second per request, it still takes 6 months.
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 17, 2024, 08:44:56 AM |
|
It looks like you use a 10 second delay between requests, is that correct? Why not reduce it to 1 second until it's done?
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk. Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
LoyceV
Legendary
Offline
Activity: 3906
Merit: 20747
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
October 17, 2024, 08:56:12 AM |
|
See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk. When I click all, I have to click "Verify" from Cloudflare. That indeed means they're "tougher" than usual. Cloudflare would automatically block my IP address with a 1015 code if rate limits were exceeded (If I ever do get rate limited, there is an exponential backoff in place), however in this case, it looks like I have tripped an anti-DDoS measure placed on the forum's web server. I just tried this oneliner: i=5503125; time while test $i -le 5503134; do wget "https://bitcointalk.org/index.php?topic=$i.0" -O $i.html; sleep 1; i=$((i+1)); done It took 13 seconds on my whitelisted server, and 12 seconds on a server that isn't whitelisted. Both include 9 seconds "sleep".
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2184
Merit: 9179
Trêvoid █ No KYC-AML Crypto Swaps
|
 |
October 17, 2024, 09:12:13 AM |
|
And another thing I've just realized - I haven't reached any mixer threads yet, but now that they are all replaced with "[banned mixer]", nobody is going to find a lot of such topics when searching by name (or URL).
So maybe I might introduce some sort of tagging system and just manually add tags to some posts (yes, posts not threads) so users can at least find what they are looking for properly.
|
|
|
|
|
|
| . betpanda.io | │ |
ANONYMOUS & INSTANT .......ONLINE CASINO....... | │ | ▄███████████████████████▄ █████████████████████████ █████████████████████████ ████████▀▀▀▀▀▀███████████ ████▀▀▀█░▀▀░░░░░░▄███████ ████░▄▄█▄▄▀█▄░░░█▄░▄█████ ████▀██▀░▄█▀░░░█▀░░██████ ██████░░▄▀░░░░▐░░░▐█▄████ ██████▄▄█░▀▀░░░█▄▄▄██████ █████████████████████████ █████████████████████████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀░░░▀██████████ █████████░░░░░░░█████████ ████████░░░░░░░░░████████ ████████░░░░░░░░░████████ █████████▄░░░░░▄█████████ ███████▀▀▀█▄▄▄█▀▀▀███████ ██████░░░░▄░▄░▄░░░░██████ ██████░░░░█▀█▀█░░░░██████ ██████░░░░░░░░░░░░░██████ █████████████████████████ ▀███████████████████████▀ | ▄███████████████████████▄ █████████████████████████ ██████████▀▀▀▀▀▀█████████ ███████▀▀░░░░░░░░░███████ ██████▀░░░░░░░░░░░░▀█████ ██████░░░░░░░░░░░░░░▀████ ██████▄░░░░░░▄▄░░░░░░████ ████▀▀▀▀▀░░░█░░█░░░░░████ ████░▀░▀░░░░░▀▀░░░░░█████ ████░▀░▀▄░░░░░░▄▄▄▄██████ █████░▀░█████████████████ █████████████████████████ ▀███████████████████████▀ | .
SLOT GAMES ....SPORTS.... LIVE CASINO | │ | ▄░░▄█▄░░▄ ▀█▀░▄▀▄░▀█▀ ▄▄▄▄▄▄▄▄▄▄▄ █████████████ █░░░░░░░░░░░█ █████████████ ▄▀▄██▀▄▄▄▄▄███▄▀▄ ▄▀▄██▄███▄█▄██▄▀▄ ▄▀▄█▐▐▌███▐▐▌█▄▀▄ ▄▀▄██▀█████▀██▄▀▄ ▄▀▄█████▀▄████▄▀▄ ▀▄▀▄▀█████▀▄▀▄▀ ▀▀▀▄█▀█▄▀▄▀▀ | Regional Sponsor of the Argentina National Team |
|
|
|
|