Bitcoin Forum
December 24, 2025, 10:20:38 PM *
News: Latest Bitcoin Core release: 30.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 [4] 5 »  All
  Print  
Author Topic: Bitcointalk Search Project  (Read 1514 times)
LoyceV
Legendary
*
Offline Offline

Activity: 3906
Merit: 20747


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
October 17, 2024, 09:54:27 AM
 #61

Suggestion: use my download (from years ago), and keep the posts that have been changed or removed. It would be nice if you can make it optional to search only the most recent version, or also older version of all posts.

¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
seoincorporation
Legendary
*
Offline Offline

Activity: 3584
Merit: 3311


View Profile
October 17, 2024, 01:40:13 PM
 #62

It has been months since you started this project mate, and after reading the threads i ask my self about the approach... Do we really need to download the full forum to have a good searching tool?

I don't think so, we can use the search engines online with the right commands to search for the right thread, let me show you how:

GOOGLE:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

YAHOO:

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

DuckDuckGo

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Yandex

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"

Bing

Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"


If we know how to use the search engines we should be able to find anything... Let me share some codes for your search:

Quote
Quotation Marks    Used to search for an exact phrase or sequence of words
Minus Sign (-)    Excludes specific words or phrases from search results
Asterisk (*)    Acts as a wildcard to represent any word or phrase in a search
Double Dots (..)    Used for number range searches
Site:    Restricts search results to a specific site or domain
Define:    Provides definitions of terms
Filetype:    Filters results by specific file type
Related:    Displays sites similar to the specified web page
Cache:    Shows the cached version of a web page
Link:    Finds pages that link to the specified URL
Inurl:    Searches for terms in the URL of web pages
Allinurl:    Searches for all terms in the URL of web pages
Intitle:    Searches for terms in the title of web pages
Allintitle:    Searches for all terms in the title of web pages
Intext:    Searches for terms in the body of web pages
Time:    Shows current time in various locations
Weather:    Shows weather conditions and forecasts for a location
Stocks:    Shows stock information
Info:    Displays some information that Google has about a web page
Book:    Find information about books
Phonebook:    Finds phone numbers
Movie:    Find information about movies
Area code:    Searches for the area code of a location
Currency:    Converts one currency to another
~    Used to include synonyms or similar terms in a search
AROUND(X)    Searches for words within X words of each other
City1 City2    Searches for pages containing both cities
Author:    Searches for content by a specific author
Source:    Finds news articles from a specific source
Map:    Shows maps related to the search query
Daterange:    Searches within a specific date range
Safesearch:    Filters out explicit content from search results
Music:    Find music information
Patent:    Searches for patents
Clinical trials:    Finds information on clinical trials

LoyceV
Legendary
*
Offline Offline

Activity: 3906
Merit: 20747


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
October 17, 2024, 02:31:43 PM
Last edit: October 17, 2024, 06:07:50 PM by LoyceV
Merited by ABCbits (1)
 #63

GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
seoincorporation
Legendary
*
Offline Offline

Activity: 3584
Merit: 3311


View Profile
October 21, 2024, 01:44:53 PM
 #64

GOOGLE:
Code:
site:bitcointalk.org intitle:"x330,000" "seoincorporation"
That worked very well when Google still had their "don't be evil" approach. Nowadays, Google only shows what they want you to see: Google partial blackholing of Bitcointalk?

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.
NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
October 31, 2024, 08:13:40 AM
 #65

See, that's the thing - there isn't actually a delay built-in to the scraper, but it appears that there is a long waiting time sometimes when I make any request to Bitcointalk.
After re-reading this: are you sure it's Cloudflare's doing? Try setting a 1 second delay, because the rate-limiting was in place long before the forum used Cloudflare.

I have no idea to be honest. But I don't want to try to diagnose this, as debugging these kind of performance issues tend to be very non-reproducible and frustrating.

Well, that explains why it didn't work the first time that i made the search, it didn't show any result, but after looking the same terms in other browsers then google decided to show the search result. And that's why i put the result with other search engines, maybe those are the right tools for a search engine.

We could write a tool that calls those engines' API, and make the same search in all those engines giving back the forum links. That would be a cool tool.

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
Vod
Legendary
*
Offline Offline

Activity: 4284
Merit: 3373


Licking my boob since 1970


View Profile WWW
October 31, 2024, 08:35:04 AM
 #66

Except that Google has even stronger anti-bot protections than Cloudflare - they make you solve the impossible reCaptcha if you search from a VPN address. Bypassing this with manual human workers a la 2captcha costs precious BTC.

AWS will give you a free EC2 instance to run your parser on.   Personally, I run two parsers on two instances (Europe and America) that access a central database controlling the frequency.  Costs me about $5/month.

If you don't need to login, you could use your AWS $300 credit to make thirty parsers, each hitting the forum once per second.  Do that for a month and you can reduce that to the free tier to stay up to date.

Smiley

NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
November 25, 2024, 03:51:05 AM
Last edit: November 25, 2024, 07:43:52 AM by NotATether
 #67

My scraper managed to crawl half of the Wall Observer topic before finally being defeated by Cloudflare. It ingested the pages in a bit over a day.

Still, an impressive achievement, considering this is the longest topic on the the website, by a wide margin.

I'm going to add the capability of checking for new messages on a thread sooner or later.



Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
BlackHatCoiner
Legendary
*
Offline Offline

Activity: 1890
Merit: 9195


Bitcoin is ontological repair


View Profile
November 25, 2024, 09:28:10 AM
 #68

Welp, it looks like my IP was rate-limited immediately after, according to logs. Time to spin up a new server Smiley
Do you mean that your IP is blocked from Cloudflare? If that's so, why don't you use a VPN?



▄▄▄▄▄▄▄▄▄▄▄░▄▄▄▄▄███▄▄▄▄▄▄▄▄▄███▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▄▄▄▄▄▄░▄▄▄▄▄▄░░▄▄▄▄▄▄▄▄▄▄▄▄▄▄░▄▄▄▄▄░▄▄▄▄▄▄▄░███████████████████░░████████▄▄░███████████████████████████████
▄█████████████████████████████████████████████████████████████░░██████████▄█████████████████▀▀███████████▀
████████████████████████████████████████████████████████████░░█████████████████████████▀████▄███████▀░░
████▄▄███████████████████████████████▄▄██████████████████████░▄██████████████████████████▄███▄███████░░░░
▀█████████████████████████████████████████████████████▀██████████████████▀▀████████████████▄▄▄█████████▄░░
██████████░▀███▀█████████████▀░▀████▀███████▀█████████████▀████████████████░░▀▀████████░▀█████████████████▄
█████████████▀███████▀▀▀████▀████▀████▀░░▀██████████████████
█████████████████████████████████████████████████████████████████████████████████▀▀▀▀▀▀
███████████████████████████████████████████████▀███▀
.
..100% WELCOME BONUS  NO KYC  UP TO 15% CASHBACK....PLAY NOW...
LoyceV
Legendary
*
Offline Offline

Activity: 3906
Merit: 20747


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
November 25, 2024, 10:05:29 AM
 #69

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.
I checked the 3 providers where I have accounts (RamNode, HostVDS and RackNerd), and none of them have Arch Linux as a default choice. If you don't mind me asking: why Arch Linux? Would uploading your own ISO be an option? I've never tried it, but RamNode supports it.

¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
Vod
Legendary
*
Offline Offline

Activity: 4284
Merit: 3373


Licking my boob since 1970


View Profile WWW
November 27, 2024, 01:57:52 AM
 #70

Anyone know any providers that easily let me install Arch Linux? Or failing that, bare metal servers.

https://aws.amazon.com/about-aws/whats-new/2023/10/new-amazon-ec2-bare-metal-instances/

Again, AWS gives $300 - $1300 to new accounts.

Vod
Legendary
*
Offline Offline

Activity: 4284
Merit: 3373


Licking my boob since 1970


View Profile WWW
November 30, 2024, 06:15:54 AM
 #71

OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin

NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
December 01, 2024, 01:41:25 PM
 #72

OP - have you developed a way to logically parse through each post and assign quotes to the proper person?   Because it's open input, people can modify it to anything, so a well organized system is necessary.    It's bugging the heck out of me, been working on it for two days now.  :/

I'm asking because I know your attention is focused on another project atm, so hopefully you'd share.   Grin

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
TryNinja
Legendary
*
Offline Offline

Activity: 3430
Merit: 9455


Quickly check the BTC price: bitlist.co/converter


View Profile WWW
December 01, 2024, 02:27:58 PM
Merited by LoyceV (4), vapourminer (1)
 #73

Currently I'm stripping out the quotes. Also code blocks. Maybe I will make a second run to extract those and assign them as post metadata somehow.

I'll have to figure out a way to handle nested quotes. And many quotes in the same reply.
A second run seems a waste of time and resources.

For sure there should be some untested cases, but I've been getting good results with this:

Code:
type PostContent = {
  raw_content: string;
  content: string;
  quoted_users: string[];
  quotes: string[];
};

function extractPostContent(html: string): PostContent {
  const $ = load(html);

  const result: PostContent = {
    raw_content: html,
    content: '',
    quoted_users: [],
    quotes: []
  };

  function extractTextContent(element: cheerio.Cheerio): string {
    return element
      .clone()
      .children('br')
      .each((_, el) => {
        $(el).replaceWith(' ');
      })
      .end()
      .children('.quoteheader')
      .each((_, el) => {
        if ($(el).children('a').length > 0) {
          $(el.next).remove();
        }
        $(el).text(' ');
      })
      .end()
      .text()
      .trim();
  }

  function processQuote(element: cheerio.Cheerio) {
    const quoteHeader = element.prev('.quoteheader');
    if (quoteHeader.length) {
      const userMatch = quoteHeader.text().match(/Quote from: (.+?) on/);
      if (userMatch) {
        result.quoted_users.push(userMatch[1]);
      }
    }

    const quoteContent = extractTextContent(element);
    if (quoteContent) {
      result.quotes.push(quoteContent);
    }

    element.find('> .quote').each((_, nestedQuote) => {
      processQuote($(nestedQuote));
    });
  }

  $('.quote').each((_, quote) => {
    if ($(quote).parent().hasClass('quote') || $(quote).prev('.quoteheader').children('a').length === 0) return;
    processQuote($(quote));
  });

  result.content = extractTextContent($('body'));

  $('.quoteheader').each((_, element) => {
    if ($(element).children('a').length > 0) {
      const elementText = $(element.next).text();
      result.content = result.content.replace(elementText, '');
    }
  });

  result.content = result.content.trim();
  result.quoted_users = [...new Set(result.quoted_users)];

  return result;
}

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
Vod
Legendary
*
Offline Offline

Activity: 4284
Merit: 3373


Licking my boob since 1970


View Profile WWW
December 01, 2024, 08:58:30 PM
 #74

Of course broken quotes can't be easily treated, maybe with AI which is too out of my league.

I wrote one of the first PPC search engines after goto/google, (sold it to a US corp in 2000) and it was tough back then just to determine the actual keywords.  This idea is a lot more difficult due to multiple sources in the same post and all the modern language sets.  I'm giving up on building any kind of search engine on my data, other than the basic SQL queries.   I'll use my data for something interesting and useful that has not been done yet.  Smiley

NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
December 20, 2024, 06:31:05 AM
 #75

I have a new VPS now, but I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.

Currently I don't exactly have that implemented yet but it wouldn't hurt to try. I don't know if that's going to be fast or not, bu it's just one file per second and both of the servers have gigabit lines.

The scraper hasn't been running for a while, so now might be a good time for me to try it.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
LoyceV
Legendary
*
Offline Offline

Activity: 3906
Merit: 20747


Thick-Skinned Gang Leader and Golden Feather 2021


View Profile WWW
December 20, 2024, 06:35:14 AM
 #76

I'd like to make it export results to my old VPS's filesystem, using something like SSHFS.
Before my dedicated server disappeared, I used to mount it's directories on another server through sshfs. That worked fine and didn't disconnect as long as both servers were running.

Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
December 20, 2024, 06:53:41 AM
 #77

Quote
I don't know if that's going to be fast or not
Alternatively, you could just run a cronjob with rsync on one of the servers, regularly fetching all updates.

I like this option. No changes to my script required and only a small administrative addition to the server.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
December 29, 2024, 06:24:00 PM
 #78

I managed to parallelize the scraper - with some quirks that I'm in the process of fixing. This new scraper is more reliable than the old one, and doesn't crash as often. It may even be faster too, but I haven't ran any benchmarks yet.

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
NotATether (OP)
Legendary
*
Offline Offline

Activity: 2184
Merit: 9179


Trêvoid █ No KYC-AML Crypto Swaps


View Profile WWW
January 04, 2025, 11:21:02 AM
 #79

Using Zendriver, we can scrape two or three pages per second, without any selenium-enforced delay.

I implemented an exponential backoff for when the forum server returns 503 errors, starting from one second and multiplying by two for each attempt.

With these changes, the scraper has become 10x faster, and continues to make the mission of organizing the forum's posts more feasible.

Once most of the posts are saved, the server can be queried at much longer intervals.

https://streamable.com/npfr6r

.
 betpanda.io 
 
ANONYMOUS & INSTANT
.......ONLINE CASINO.......
▄███████████████████████▄
█████████████████████████
█████████████████████████
████████▀▀▀▀▀▀███████████
████▀▀▀█░▀▀░░░░░░▄███████
████░▄▄█▄▄▀█▄░░░█▄░▄█████
████▀██▀░▄█▀░░░█▀░░██████
██████░░▄▀░░░░▐░░░▐█▄████
██████▄▄█░▀▀░░░█▄▄▄██████
█████████████████████████
█████████████████████████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀░░░▀██████████
█████████░░░░░░░█████████
███████░░░░░░░░░███████
████████░░░░░░░░░████████
█████████▄░░░░░▄█████████
███████▀▀▀█▄▄▄█▀▀▀███████
██████░░░░▄░▄░▄░░░░██████
██████░░░░█▀█▀█░░░░██████
██████░░░░░░░░░░░░░██████
█████████████████████████
▀███████████████████████▀
▄███████████████████████▄
█████████████████████████
██████████▀▀▀▀▀▀█████████
███████▀▀░░░░░░░░░███████
██████░░░░░░░░░░░░▀█████
██████░░░░░░░░░░░░░░▀████
██████▄░░░░░░▄▄░░░░░░████
████▀▀▀▀▀░░░█░░█░░░░░████
████░▀░▀░░░░░▀▀░░░░░█████
████░▀░▀▄░░░░░░▄▄▄▄██████
█████░▀░█████████████████
█████████████████████████
▀███████████████████████▀
.
SLOT GAMES
....SPORTS....
LIVE CASINO
▄░░▄█▄░░▄
▀█▀░▄▀▄░▀█▀
▄▄▄▄▄▄▄▄▄▄▄   
█████████████
█░░░░░░░░░░░█
█████████████

▄▀▄██▀▄▄▄▄▄███▄▀▄
▄▀▄█████▄██▄▀▄
▄▀▄▐▐▌▐▐▌▄▀▄
▄▀▄█▀██▀█▄▀▄
▄▀▄█████▀▄████▄▀▄
▀▄▀▄▀█████▀▄▀▄▀
▀▀▀▄█▀█▄▀▄▀▀

Regional Sponsor of the
Argentina National Team
bitrr.io
Copper Member
Newbie
*
Offline Offline

Activity: 20
Merit: 2


View Profile
January 31, 2025, 02:44:02 PM
 #80

I am trying to make a search engine for Bitcointalk posts, since Google and the built-in one are so bad.

List all the features you want in a search engine here.

For now, I am scraping topics from the forum using my bot. I made sure to identify the requests as coming from me in my program so that the admins know where this traffic is coming from.

It doesn't look like it's exceeding the threshold of one request per second so that's good.

Private boards are not being scraped. The scraping is being done as a guest.

That sounds like an awesome project! A dedicated Bitcointalk search engine would be a huge improvement over Google and the built-in search. Here are some features that would make it really useful.

1. Better Search Filters
Search by User – Find posts by a specific username.
Date Range – Filter results by specific timeframes.
Board/Category Search – Only show results from selected sections like Altcoins, Services, or Mining.
Keyword Relevance – Prioritize results based on keyword frequency and context.
Post Type Filtering – Option to search only for thread titles or include replies.

2. Faster and Smarter Results
Real-time indexing so new posts show up quickly.
Cached results for faster searches without hitting the server hard.

3. Better Post Previews
Show snippets of posts before clicking.
Preserve BBCode formatting so posts don’t look broken.

4. Sorting by Popularity & Engagement
Sort by most replies, most views, or most merit to find valuable discussions faster.

5. Mobile-Friendly & Dark Mode
Clean, simple design that works well on mobile.
Dark mode support for easier reading at night.

6. API for External Use
A public API so devs can integrate search results elsewhere.

7. Save & Bookmark Searches

Ability to save frequent searches and access them later.

It’s great that you're being mindful of forum limits and not scraping private boards. Keeping traffic within reasonable limits should help avoid any admin issues. Definitely excited to see how this evolves keep us posted! 🚀
Pages: « 1 2 3 [4] 5 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!