Bitcoin Forum
May 03, 2024, 11:00:08 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: What i discovered about webscraping Bitcointalk.org  (Read 141 times)
Zilon (OP)
Sr. Member
****
Offline Offline

Activity: 966
Merit: 421

Bitcoindata.science


View Profile WWW
April 03, 2023, 08:36:03 PM
Last edit: April 04, 2023, 07:34:12 AM by Zilon
Merited by DdmrDdmr (4), Halab (2), Sexylizzy2813 (1)
 #1

Hello mates i tried doing some fun stuffs with python BeautifulSoup library to scrap some information and possibly save them in a variable maybe to get to see the anchor tags and also scrap to see users with the highest activity in the last 20 days, play around with some informations scrapped from bitcointalk url  but unfortunately i got an error message. I tried the code on a few other sited and it worked well but that of the forum gave me this error



i tried the same code on a few other sites like analytics
I was able to get all the href and anchor tags from the sites

i did similar for facebook and it worked so i kept wondering why it didn't work for Bitcointalk url. I will be glad if some one can educate me why i can't scrap information from the forum.

1714777208
Hero Member
*
Offline Offline

Posts: 1714777208

View Profile Personal Message (Offline)

Ignore
1714777208
Reply with quote  #2

1714777208
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714777208
Hero Member
*
Offline Offline

Posts: 1714777208

View Profile Personal Message (Offline)

Ignore
1714777208
Reply with quote  #2

1714777208
Report to moderator
1714777208
Hero Member
*
Offline Offline

Posts: 1714777208

View Profile Personal Message (Offline)

Ignore
1714777208
Reply with quote  #2

1714777208
Report to moderator
1714777208
Hero Member
*
Offline Offline

Posts: 1714777208

View Profile Personal Message (Offline)

Ignore
1714777208
Reply with quote  #2

1714777208
Report to moderator
jackg
Copper Member
Legendary
*
Offline Offline

Activity: 2856
Merit: 3071


https://bit.ly/387FXHi lightning theory


View Profile
April 04, 2023, 03:03:37 AM
 #2

Where's your code? Are you doing any looping (trying to load the website multiple times a second will result in an error, not sure if there's something else too as you've not added your code - feel free to dm if you don't want to post it publicly but remove login details if there are any).

time.sleep(1000) would be enough to add to a loop to stop the error - the time is in milliseconds if you want to edit it.
Zilon (OP)
Sr. Member
****
Offline Offline

Activity: 966
Merit: 421

Bitcoindata.science


View Profile WWW
April 04, 2023, 06:30:02 AM
Last edit: April 04, 2023, 07:19:32 AM by Zilon
 #3

Where's your code? Are you doing any looping (trying to load the website multiple times a second will result in an error, not sure if there's something else too as you've not added your code - feel free to dm if you don't want to post it publicly but remove login details if there are any).

time.sleep(1000) would be enough to add to a loop to stop the error - the time is in milliseconds if you want to edit it.
my code  is on the <img> element i posted but i will still type them if it is not visible

Code:
! pip install BeautifulSoup
import urllib
import re
from bs4 import BeautifulSoup
import time

time.sleep(1000)
r =  urllib.request.urlopen('https://bitcointalk.org/index.php?').read()
soup = BeautifulSoup(r, 'html.parser')
type(soup)
I added the time.sleep(1000) but instead the entire cell went to sleep then finally popped up with the same error message:::
Code:
HTTPError: HTTP Error 403: Forbidden
OmegaStarScream
Staff
Legendary
*
Offline Offline

Activity: 3472
Merit: 6115



View Profile
April 04, 2023, 07:40:22 AM
Merited by hosseinimr93 (4), DdmrDdmr (4), ABCbits (2), vapourminer (1)
 #4

There are two options here:

To use requests instead of urllib:

Code:
import urllib
import requests
from bs4 import BeautifulSoup

r =  requests.get('https://bitcointalk.org/index.php')
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)

Or add a user-agent to the request you're making:

Code:
import urllib
import requests
from bs4 import BeautifulSoup
import time

r =  urllib.request.Request('https://bitcointalk.org/index.php?', headers={'User-Agent': 'Mozilla/5.0'})
response = urllib.request.urlopen(r)
soup = BeautifulSoup(response.read(), 'html.parser')
print(soup)

Either way, make sure you're not sending requests too often[1]. You should use time.sleep but that function takes seconds in Python, and not milliseconds.

[1] https://bitcointalk.org/index.php?topic=953815.msg10442011#msg10442011

█▀▀▀











█▄▄▄
▀▀▀▀▀▀▀▀▀▀▀
e
▄▄▄▄▄▄▄▄▄▄▄
█████████████
████████████▄███
██▐███████▄█████▀
█████████▄████▀
███▐████▄███▀
████▐██████▀
█████▀█████
███████████▄
████████████▄
██▄█████▀█████▄
▄█████████▀█████▀
███████████▀██▀
████▀█████████
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
c.h.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▀▀▀█











▄▄▄█
▄██████▄▄▄
█████████████▄▄
███████████████
███████████████
███████████████
███████████████
███░░█████████
███▌▐█████████
█████████████
███████████▀
██████████▀
████████▀
▀██▀▀
NotATether
Legendary
*
Offline Offline

Activity: 1596
Merit: 6727


bitcoincleanup.com / bitmixlist.org


View Profile WWW
April 04, 2023, 07:58:23 AM
 #5

Try using Requests library to read the data instead of URLlib3.

Although I no longer have the code sample to show you, my implementation of a post scraper using Requests worked magnificently well, with a timeout of 1 second.

You're probably running into issues with Cloudflare though, hence the 403. Maybe you should chain an anti-captcha browser or service to the library as well.

.
.BLACKJACK ♠ FUN.
█████████
██████████████
████████████
█████████████████
████████████████▄▄
░█████████████▀░▀▀
██████████████████
░██████████████
████████████████
░██████████████
████████████
███████████████░██
██████████
CRYPTO CASINO &
SPORTS BETTING
▄▄███████▄▄
▄███████████████▄
███████████████████
█████████████████████
███████████████████████
█████████████████████████
█████████████████████████
█████████████████████████
███████████████████████
█████████████████████
███████████████████
▀███████████████▀
█████████
.
Zilon (OP)
Sr. Member
****
Offline Offline

Activity: 966
Merit: 421

Bitcoindata.science


View Profile WWW
April 04, 2023, 08:45:51 AM
 #6

....
Thank you it solved the problem well.. And worked just fine

Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!