Scraping forum content

anaconda46 (OP)

Newbie

Offline

Activity: 5
Merit: 0

Scraping forum content

December 10, 2019, 09:01:53 AM

#1

I would like to confirm the policy surrounding scraping content via automated means. Is this something that is permissible? The stipulation is no more than one (1) page will be accessed every one (1) second.

CodyAlfaridzi

Hero Member

Offline

Activity: 1708
Merit: 541

Re: Scraping forum content

December 10, 2019, 09:10:44 AM

Merited by LoyceV (1)

#2

Quote from: anaconda46 on December 10, 2019, 09:01:53 AM

I would like to confirm the policy surrounding scraping content via automated means. Is this something that is permissible? The stipulation is no more than one (1) page will be accessed every one (1) second.

Yes.

Quote from: theymos on February 12, 2015, 10:03:56 PM

The rules are the same as for humans. But keep in mind:
- No one is allowed to access the site more often than once per second on average. (Somewhat higher burst accesses are OK.)
- Every post must be on-topic. Any bot response to a topic is almost certainly off-topic. Changetip's behavior of responding to user commands publicly would not be allowed, for example.
- If someone complains about an unsolicited PM you send them, then you're probably going to be banned.

Quote from: theymos on March 26, 2014, 11:17:45 PM

Those IPs are not blocked currently. But your other abusive IPs were blocked. Just your quotefast requests (which are only part of what your crawler does) were occurring at an average frequency of 7.6 requests per second in the most recent access logging period. Your requests constituted 3.4% of all forum requests in this period. This is entirely unacceptable and of course resulted in those IPs being banned.

The maximum allowed bot request frequency is 1 request per second. Those IPs are now accessing pages at an average of 2.5 requests per second combined. If you continue exceeding the allowed request limit, we will continue banning your IPs.

Tytanowy Janusz

Legendary

Offline

Activity: 2156
Merit: 1622

Re: Scraping forum content

December 10, 2019, 05:56:01 PM

#3

LoyceV is scraping forum all the time. You can try to use his data for your purpose.
Few of his threads:

https://bitcointalk.org/index.php?topic=5202231.0
https://bitcointalk.org/index.php?topic=5153695.0
https://bitcointalk.org/index.php?topic=5134444.0

And this his page with bunch of scrapped data (profiles, posts).
http://loyce.club/archive/

Initscri

Hero Member

Offline

Activity: 1540
Merit: 759

Re: Scraping forum content

December 10, 2019, 06:06:15 PM

#4

Quote from: Tytanowy Janusz on December 10, 2019, 05:56:01 PM

LoyceV is scraping forum all the time. You can try to use his data for your purpose.
Few of his threads:

https://bitcointalk.org/index.php?topic=5202231.0
https://bitcointalk.org/index.php?topic=5153695.0
https://bitcointalk.org/index.php?topic=5134444.0

And this his page with bunch of scrapped data (profiles, posts).
http://loyce.club/archive/

Also, just to add to this as well, if LoyceV's data works for you, and w/ his permission, you're able to use it, that would be recommended/best.

The last thing the forum needs IMO is 20+ people scraping the forum for the same data but for different purposes

----------------------------------
Web Developer. PM for details.
----------------------------------

LoyceV

Legendary

Offline

Activity: 3304
Merit: 16609

Thick-Skinned Gang Leader and Golden Feather 2021

Re: Scraping forum content

December 10, 2019, 07:01:17 PM

#5

Quote from: Initscri on December 10, 2019, 06:06:15 PM

Also, just to add to this as well, if LoyceV's data works for you, and w/ his permission, you're able to use it, that would be recommended/best.

I publish data for a reason: use it anyway you can

@anaconda46: can you share what you're looking for? For the right purpose, I could make my data accessible in compressed file format so you don't have to download a couple million files.

Quote

The last thing the forum needs IMO is 20+ people scraping the forum for the same data but for different purposes

Correct. My fear is that would eventually lead to scraping being banned unless you get whitelisted.

LoyceV's Signature for rent

anaconda46 (OP)

Newbie

Offline

Activity: 5
Merit: 0

Re: Scraping forum content

December 11, 2019, 05:59:45 AM

#6

Quote from: LoyceV on December 10, 2019, 07:01:17 PM

@anaconda46: can you share what you're looking for? For the right purpose, I could make my data accessible in compressed file format so you don't have to download a couple million files.

I am doing some research. I am not actually downloading any files. I am saving information to a postgress SQL database.

As long as there are no "are you a human" tests, scraping traffic should be identical to normal traffic with the library I am using.

DdmrDdmr

Legendary

Offline

Activity: 2310
Merit: 10758

There are lies, damned lies and statistics. MTwain

Re: Scraping forum content

December 11, 2019, 08:30:39 AM

#7

Quote from: anaconda46 on December 11, 2019, 05:59:45 AM

<…> As long as there are no "are you a human" tests <...>

There aren’t (yet). Cloudflare seems to handle things pretty well, giving you an error if you try to load pages at a rate faster than 1 page per second. Other than that, I have never encountered any issues to date (although I scrape a limited amount during the time span of 12 consecutive hours or so, and every now and then go for a while longer).

Welsh

Staff
Legendary

Offline

Activity: 3262
Merit: 4110

Re: Scraping forum content

December 11, 2019, 02:25:33 PM

#8

Depending on the content you're scraping it might be better to contact third parties that are already collecting the data. I think between them they've got most data covered. Some of these might have an API instead of scraping also.

LoyceV

Legendary

Offline

Activity: 3304
Merit: 16609

Thick-Skinned Gang Leader and Golden Feather 2021

Re: Scraping forum content

December 11, 2019, 03:30:12 PM

#9

Quote from: anaconda46 on December 11, 2019, 05:59:45 AM

I am doing some research.

Do you mind sharing what you're researching?

LoyceV's Signature for rent

anaconda46 (OP)

Newbie

Offline

Activity: 5
Merit: 0

Re: Scraping forum content

December 11, 2019, 03:30:54 PM

#10

Quote from: DdmrDdmr on December 11, 2019, 08:30:39 AM

Quote from: anaconda46 on December 11, 2019, 05:59:45 AM

<…> As long as there are no "are you a human" tests <...>

There aren’t (yet). Cloudflare seems to handle things pretty well, giving you an error if you try to load pages at a rate faster than 1 page per second. Other than that, I have never encountered any issues to date (although I scrape a limited amount during the time span of 12 consecutive hours or so, and every now and then go for a while longer).

You can run run this code to keep your script running 24/7 on a server:
def scrape_function():
   try:
   #8 spaces
   some_code_to_scrape
   except:
   pass
some_variable = True
while some_variable == True:
   scrape_function()

Quote from: Welsh on December 11, 2019, 02:25:33 PM

Depending on the content you're scraping it might be better to contact third parties that are already collecting the data. I think between them they've got most data covered. Some of these might have an API instead of scraping also.

Where can I find these APIs? What data does it cover?

Welsh

Staff
Legendary

Offline

Activity: 3262
Merit: 4110

Re: Scraping forum content

December 11, 2019, 03:37:53 PM

#11

Quote from: anaconda46 on December 11, 2019, 03:30:54 PM

Where can I find these APIs? What data does it cover?

Depends on the data that you require. I'm sure if its a decent project one of the data gurus here will be able to help you out by developing a API. I don't think any of them have an API, but if they are concerned about scraping becoming a issue if a lot of users are doing it that might be the route to take.

Initscri

Hero Member

Offline

Activity: 1540
Merit: 759

Re: Scraping forum content

December 12, 2019, 01:00:16 AM

#12

Quote from: anaconda46 on December 11, 2019, 03:30:54 PM

Quote from: DdmrDdmr on December 11, 2019, 08:30:39 AM

Quote from: anaconda46 on December 11, 2019, 05:59:45 AM

<…> As long as there are no "are you a human" tests <...>

There aren’t (yet). Cloudflare seems to handle things pretty well, giving you an error if you try to load pages at a rate faster than 1 page per second. Other than that, I have never encountered any issues to date (although I scrape a limited amount during the time span of 12 consecutive hours or so, and every now and then go for a while longer).

You can run run this code to keep your script running 24/7 on a server:
def scrape_function():
try:
#8 spaces
some_code_to_scrape
except:
pass
some_variable = True
while some_variable == True:
scrape_function()

Quote from: Welsh on December 11, 2019, 02:25:33 PM

Depending on the content you're scraping it might be better to contact third parties that are already collecting the data. I think between them they've got most data covered. Some of these might have an API instead of scraping also.

Where can I find these APIs? What data does it cover?

Be careful w/ that code btw. If using, you're going to want to put a delay on the loop to ensure you're not unintentionally DOSing the site.

----------------------------------
Web Developer. PM for details.
----------------------------------

anaconda46 (OP)

Newbie

Offline

Activity: 5
Merit: 0

Re: Scraping forum content

December 12, 2019, 05:20:17 AM

#13

Quote from: Initscri on December 12, 2019, 01:00:16 AM

Quote from: anaconda46 on December 11, 2019, 03:30:54 PM

Quote from: DdmrDdmr on December 11, 2019, 08:30:39 AM

Quote from: anaconda46 on December 11, 2019, 05:59:45 AM

<…> As long as there are no "are you a human" tests <...>

There aren’t (yet). Cloudflare seems to handle things pretty well, giving you an error if you try to load pages at a rate faster than 1 page per second. Other than that, I have never encountered any issues to date (although I scrape a limited amount during the time span of 12 consecutive hours or so, and every now and then go for a while longer).

You can run run this code to keep your script running 24/7 on a server:
def scrape_function():
try:
#8 spaces
some_code_to_scrape
except:
pass
some_variable = True
while some_variable == True:
scrape_function()

Quote from: Welsh on December 11, 2019, 02:25:33 PM

Depending on the content you're scraping it might be better to contact third parties that are already collecting the data. I think between them they've got most data covered. Some of these might have an API instead of scraping also.

Where can I find these APIs? What data does it cover?

Be careful w/ that code btw. If using, you're going to want to put a delay on the loop to ensure you're not unintentionally DOSing the site.

Yes, I use:
time.sleep(1)
the 1 inside the function signals to stop for one second before moving on. This will mean I access a page slightly less frequently than once per second because my VPS needs to send data to my postgres server and fetch HTML