NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
May 23, 2025, 11:33:51 AM |
|
Offtopic, I hope you appreciate getting more questions instead of more answers. I do believe asking the right questions is more helpful to start your research. I can't vouch for the quality of ai answers, just that it looked interesting. I'm not a programmer, but it does offer to write your code as well.
I appreciate it greatly. I have done some looking around over the past few days, and I found a machine learning model called BERT that was made by Google in 2018 for search engines. Can you believe that. An AI model from before AI models were a thing.  I do have sort of a background in machine learning models, so I can summarize it briefly here: Instead of vectorizing words, and thus relying on keywords to search, it vectorizes entire phrases. Words that are adjacent to each other in a sentence. This makes natural language search possible (example: "block size wars" returning debates about segwit and bcash instead of only posts with "block size" in them). There are many improved versions of BERT nowadays, large ones and small ones. However, the models require dedicated hardware with GPUs to run. The good news is, Elasticsearch makes it painfully easy to deploy a model. You literally just have to press the "Run" button next to it. And then search algorithms will be using the model automatically. The bad news is, they don't come cheap. There is one ML node in my cluster, which I receive at no additional cost, but it only has 1GB of RAM and can't store any model, so it's pretty useless. Upgrading to the next hardware tier that has 2GB is going to bump the total monthly bill to around $300. And I am already hounded enough by Google with biweekly invoices. Therefore I want to wait until all the new post content is uploaded before I delete the old, incomplete post content, which will allow me to slash the storage size by about half. Then adding a larger ML node will make Talksearch's running cost somewhat lower than they are right now. It will be a wise investment, though. GPUs on dedicated servers are not plentiful, and are much expensive than this. Unfortunately, despite thousands and thousands of post chunks a day being uploaded, I am only about 10% of the way there. I can't experiment with BERT search until it's done. And pray my server doesn't run out of memory mid-upload, because my disk being the primary bottleneck means that retries will not be faster. But move to an SSD or something and the Elasticsearch nodes get overwhelmed with requests and run out of memory themselves. I imagine this whole process becomes much faster with even larger hardware, but that is not an amount I'm willing to spend, especially on a beta product. Good thing there is only one "initial block download" - after that I'll never have to worry about that again (unless catastrophic data loss occurs, as I'm only paying for one availability zone - ugh).
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
May 26, 2025, 02:49:33 PM Last edit: May 27, 2025, 07:20:29 AM by NotATether |
|
It appears that there is a problem with making search queries again. I will investigate this.
Please do not delete this post.
Update: The problem has been identified. It appears that the access token has expired. I am currently deploying a fix and will update you when this is done.
Update 2: It has been fixed.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
May 31, 2025, 08:34:00 AM |
|
Guys, I need some suggestions. I want to move the Elasticsearch server off of Google Cloud, due to AML problems I'm now facing when I attempt to load my card $100 to pay the bills.
What are some hosting providers that *do not* use Coingate or Cryptomus?
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3710
Merit: 19111
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
May 31, 2025, 08:47:08 AM |
|
What are some hosting providers that *do not* use Coingate or Cryptomus? I got my last VPS from Servarica, but can't remember which payment provider they used. I checked my email, and it doesn't show anything from any external provider. I can't really test it by making a payment now, maybe just ask them? This is the offer I took (8 slices Slim Plan + 2 TB SAN Storage).
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
May 31, 2025, 09:48:17 AM |
|
What are some hosting providers that *do not* use Coingate or Cryptomus? I got my last VPS from Servarica, but can't remember which payment provider they used. I checked my email, and it doesn't show anything from any external provider. I can't really test it by making a payment now, maybe just ask them? This is the offer I took (8 slices Slim Plan + 2 TB SAN Storage). It's only enough to know if they support Monero payments or not. If so, then no transaction screening on any coins since XMR is untraceable anyway. Looks like I'm going to be scouring LowEndTalk for a while. Some specs I'm looking for to make searching easier: - 512GB SSD - At least 16GB of memory - more is obviously good, I want indexing to be instantaneous this time, instead of taking months. - A regular Intel/AMD processor will do ( Apparently, I don't need a GPU. w00t!) - 1 Gbps Ethernet I'm fine with spending $100/month on this, but deals are obviously nice.
|
|
|
|
LoyceV
Legendary
Offline
Activity: 3710
Merit: 19111
Thick-Skinned Gang Leader and Golden Feather 2021
|
 |
May 31, 2025, 10:00:58 AM |
|
Looks like I'm going to be scouring LowEndTalk for a while. Note: I've seen and paid good and bad providers, and I've been burned more than once. So be careful who you trust. I'm quite happy with Racknerd too: - 512GB SSD - At least 16GB of memory - more is obviously good, I want indexing to be instantaneous this time, instead of taking months. - 1 Gbps Ethernet
I'm fine with spending $100/month on this, but deals are obviously nice. At that price, you could get a Premium KVM or VDS with 4 CPU, 16 GB RAM and 400 NVMe at Ramnode Cloud. They're good, but expensive. I only use them when I need it shortly: $0.15 per hour you use it.
|
¡uʍop ǝpᴉsdn pɐǝɥ ɹnoʎ ɥʇᴉʍ ʎuunɟ ʞool no⅄
|
|
|
psycodad
Legendary
Offline
Activity: 1731
Merit: 2195
精神分析的爸
|
 |
May 31, 2025, 11:35:29 AM |
|
Looks like I'm going to be scouring LowEndTalk for a while. Note: I've seen and paid good and bad providers, and I've been burned more than once. So be careful who you trust. I'm quite happy with Racknerd too: I can second that statement about Racknerd, running a kvm vps there since ~3yrs and no single problem so far. But I concede that I am a few inches days short in the uptime-dick-swinging-contest (damn them friggin kernel update reboots..):  Though unfortunately Racknerd accepts some crypto but not Monero: We accept the following payment methods:
ALL major credit cards (AMEX, Discover, VISA, Master). PayPal Cryptocurrency (Bitcoin, Bitcoin Cash, Litecoin, Ethereum, USDT, USDC) Alipay/支付宝 Wire
More payment methods are supported upon checking out.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
May 31, 2025, 12:03:57 PM Last edit: June 01, 2025, 06:25:46 PM by NotATether |
|
I've settled on this beauty from Dartnode: Model: Dual Xeon E5-2650 v4 Specs --- Base Price: Dual Xeon E5-2650 v4 OS Install: AlmaLinux 9.4 Memory: 32 GB DDR4 Drive #1: 500 GB NVMe (boot) Ip Addresses: 1x IPv4 Port Speed: 1 Gbps (Unmetered) DDoS Protection: Enabled
It only costs me $100 a month, so it's a massive improvement from Google Cloud. The application itself will still be hosted there by the way, as it costs almost nothing to run. It's just the Elasticsearch server(s) being moved. It isn't actually usable yet, it is still in the setup phase. Edit: For some reason, the DDoS protection is a $35 addon. Whatever. That's already been added. I only have about 3 or so days to set up the new server with elasticsearch before I have to move money around again to the cards, so I have to do it fast as I'd like to avoid that.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 02, 2025, 09:41:56 AM Last edit: June 02, 2025, 10:48:41 AM by NotATether |
|
Ingestion has now started on the new Elasticsearch server, and compared to my old cluster it's going lightning fast. If all goes well, it should be finished in about a day or two, and then I will redirect the search queries towards it and then shut down the old cluster. Edit: Wow, already over 200k posts indexed in just an hour!  According to my calculations, about 5 million posts can be uploaded in a single day. Therefore it's going to take up to 2 weeks for everything to get in there, but guess what? No resource exhaustion this time, so no crashes. I still plan to shut off the old cluster ASAP.
|
|
|
|
Vod
Legendary
Offline
Activity: 4102
Merit: 3274
Licking my boob since 1970
|
 |
June 02, 2025, 07:20:15 PM |
|
Ingestion has now started on the new Elasticsearch server
I looked into that for my new project - it allows you to search for a minimum of TWO characters instead of three. It's expensive though... Hopefully you'll let me add your engine to my extension so the user can choose.
|
░░░░▄▄████████████▄ ░▄████████████████▀ ▄████████████████▀▄█▄ ▄███████▀▀░░▄███▀▄████▄ ▄██████▀░░░▄███▀░▀██████▄ ██████▀░░▄████▄░░░▀██████ ██████░░▀▀▀▀░▄▄▄▄░░██████ ██████▄░░░▀████▀░░▄██████ ▀██████▄░▄███▀░░░▄██████▀ ▀████▀▄████░░▄▄███████▀ ▀█▀▄████████████████▀ ▄████████████████▀░ ▀████████████▀▀░░░░ | | CCECASH | | | | ANN THREAD TUTORIAL |
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 05, 2025, 06:38:50 AM Last edit: June 05, 2025, 07:22:00 AM by NotATether |
|
NOTICEPlanned maintenance has commenced on Talksearch. (It did not start exactly as planned, because of ongoing bullshit from my internet provider.) During this time, search queries will be redirected to the new cluster. This post will be updated periodically with the status as it progresses. Update 06:57 utc - migration has finished and the service is being brought back online. Update 10:09 utc - Talksearch service brought back offline. Search traffic was moved to a new cluster. Posts may be missing while the index is filled over the next few days. Update 10:13 utc - the old Elasticsearch cluster on Google Cloud has been deleted. Maintenance has been completed.
I looked into that for my new project - it allows you to search for a minimum of TWO characters instead of three. It's expensive though...
It has a free, open source version, but it needs to run on very powerful hardware to be useful. Hopefully you'll let me add your engine to my extension so the user can choose.
We can talk about that later. The immediate priority right now for me is to create significantly more powerful search parameters on the website.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 09, 2025, 10:52:31 AM |
|
Bump (Merit overload managed to push this thread all the way down to page 2  ) The algorithm feels awful though - any suggestions for suggestions on how I should improve it?
|
|
|
|
joker_josue
Legendary
Offline
Activity: 2058
Merit: 5834
**In BTC since 2013**
|
 |
June 09, 2025, 05:19:00 PM |
|
The algorithm feels awful though - any suggestions for suggestions on how I should improve it?
What do you mean horrible? What do you think he's doing wrong for the proposed goals?
|
| . BC.GAME | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀░▀██████ ████▀░░░░░▀████ ███░░░░░░░░░███ ███▄░░▄░▄░░▄███ █████▀░░░▀█████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ███░░▀░░░▀░░███ ███░░▄▄▄░░▄████ ███▄▄█▀░░▄█████ █████▀░░▐██████ █████░░░░██████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀▀░▀▄░███ ████▀░░▄░▄░▀███ ███▀░░▀▄▀▄░▄███ ███▄░░▀░▀░▄████ ███░▀▄░▄▄██████ ███████████████ ███████████████ ███████████████ ███████████████ | │ │ | DEPOSIT BONUS .1000%. | GET FREE ...5 BTC... | │ │ | REFER & EARN ..$1000 + 15%.. COMMISSION | │ │ | Play Now |
|
|
|
Ivystar5
Full Member
 
Offline
Activity: 273
Merit: 146
Stressed since 19's
|
 |
June 09, 2025, 06:45:56 PM |
|
The algorithm feels awful though - any suggestions for suggestions on how I should improve it?
I was thinking of we can get to an advanced stage where I can input a prompt like "what does Satoshi say about Bitcointalk adminstration?" and it will give results of threads where Satoshi talked about the administration of the forum, which in there on will able to figure out the exact thread or discussion that he or she is searching for. More like an AI type of research response with links to several related threads. I did try to ask a question like this but, it only delivers threads with titles that has each of the word in accordance. Why I wanted this, is because sometimes having an argument that requires you to provide links or thread where a user said something somehow becomes difficult as one will have to search several times or even have to remember some statements that are in the thread.
|
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 10, 2025, 05:28:41 AM |
|
The algorithm feels awful though - any suggestions for suggestions on how I should improve it?
What do you mean horrible? What do you think he's doing wrong for the proposed goals? It prioritizes occurrences too much. So when you search "casino" for example, the top results are the ones that have written casino two or three times in the title. It makes it feel spammy, but I'm waiting until all the content is uploaded before I do anything about it. Fortunately, this time, it will only take a few more days.
|
|
|
|
joker_josue
Legendary
Offline
Activity: 2058
Merit: 5834
**In BTC since 2013**
|
 |
June 10, 2025, 06:58:05 AM |
|
It prioritizes occurrences too much. So when you search "casino" for example, the top results are the ones that have written casino two or three times in the title.
It makes it feel spammy, but I'm waiting until all the content is uploaded before I do anything about it. Fortunately, this time, it will only take a few more days.
Well, that's the biggest challenge for search engines. It took Google years to create an algorithm that could minimize this situation. To help minimize this, you have to create more filter criteria. For example, in addition to looking at just the title, it has to look at the topic content. An example: the topic title has the word "casino" 3 times and how many times does the OP have? Throughout the topic, does the word "casino" appear more often or not at all? Is the term "casino" in a conversational context or in the context of a name of something? Applying the rules and ensuring a good balance is not easy. This will undoubtedly be the biggest challenge of the project.
|
| . BC.GAME | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀░▀██████ ████▀░░░░░▀████ ███░░░░░░░░░███ ███▄░░▄░▄░░▄███ █████▀░░░▀█████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ███░░▀░░░▀░░███ ███░░▄▄▄░░▄████ ███▄▄█▀░░▄█████ █████▀░░▐██████ █████░░░░██████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀▀░▀▄░███ ████▀░░▄░▄░▀███ ███▀░░▀▄▀▄░▄███ ███▄░░▀░▀░▄████ ███░▀▄░▄▄██████ ███████████████ ███████████████ ███████████████ ███████████████ | │ │ | DEPOSIT BONUS .1000%. | GET FREE ...5 BTC... | │ │ | REFER & EARN ..$1000 + 15%.. COMMISSION | │ │ | Play Now |
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 10, 2025, 03:42:17 PM |
|
Well, that's the biggest challenge for search engines. It took Google years to create an algorithm that could minimize this situation.
To help minimize this, you have to create more filter criteria. For example, in addition to looking at just the title, it has to look at the topic content.
An example: the topic title has the word "casino" 3 times and how many times does the OP have? Throughout the topic, does the word "casino" appear more often or not at all? Is the term "casino" in a conversational context or in the context of a name of something?
Applying the rules and ensuring a good balance is not easy. This will undoubtedly be the biggest challenge of the project.
It's not just spam, there are for some reason a ton of topics in search results that have been deleted on the forum. So they all have to be purged. Finding out which ones are deleted is going to be a challenge as it will require another forum scrape.
|
|
|
|
joker_josue
Legendary
Offline
Activity: 2058
Merit: 5834
**In BTC since 2013**
|
 |
June 10, 2025, 04:09:10 PM |
|
It's not just spam, there are for some reason a ton of topics in search results that have been deleted on the forum. So they all have to be purged.
Finding out which ones are deleted is going to be a challenge as it will require another forum scrape.
But what kind of sweep did you do to collect topics that have already been deleted? Did you use an old database? Maybe you can just run a script to validate if a certain topic exists, if it doesn't exist it deletes it from the DB. Or you may want to use this as a historical archive.
|
| . BC.GAME | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀░▀██████ ████▀░░░░░▀████ ███░░░░░░░░░███ ███▄░░▄░▄░░▄███ █████▀░░░▀█████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ███░░▀░░░▀░░███ ███░░▄▄▄░░▄████ ███▄▄█▀░░▄█████ █████▀░░▐██████ █████░░░░██████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀▀░▀▄░███ ████▀░░▄░▄░▀███ ███▀░░▀▄▀▄░▄███ ███▄░░▀░▀░▄████ ███░▀▄░▄▄██████ ███████████████ ███████████████ ███████████████ ███████████████ | │ │ | DEPOSIT BONUS .1000%. | GET FREE ...5 BTC... | │ │ | REFER & EARN ..$1000 + 15%.. COMMISSION | │ │ | Play Now |
|
|
|
NotATether (OP)
Legendary
Offline
Activity: 2002
Merit: 8606
Search? Try talksearch.io
|
 |
June 11, 2025, 08:28:44 AM |
|
But what kind of sweep did you do to collect topics that have already been deleted? Did you use an old database?
Maybe you can just run a script to validate if a certain topic exists, if it doesn't exist it deletes it from the DB. Or you may want to use this as a historical archive.
Most of the old posts came were from Ninjastic.space. While I figure out how to weed out the old posts, I've ran some tests on Google Collaboratory with three different spam-detection LLM models (well, they are not specifically for spam detection except for the first one, but it can be used to classify text) on various categories. https://pdflink.to/bert-tiny-finetuned-sms-spam-detection/https://pdflink.to/distilbert-base-uncased-finetuned-sst-2-english/https://pdflink.to/deberta-large-mnli/I think they get the overall sentiment, especially the last one, but it would be unwise to rely only on a LLM as a universal quality score. Additional measures must be taken in place to identify e.g. application posts, obviously AI-generated posts, and such in order to not return them in search results. I'm also going to place a minimum post length, to avoid indexing things like bumps.
|
|
|
|
joker_josue
Legendary
Offline
Activity: 2058
Merit: 5834
**In BTC since 2013**
|
 |
June 11, 2025, 05:15:01 PM |
|
I'm also going to place a minimum post length, to avoid indexing things like bumps.
Have you ever thought about a post/topic author rating system? A higher ranked user - more posts, merit, ranking - has passes the filters. The rest have to go through tighter filters. This may help reduce the number of posts analyzed, and help filter better.
|
| . BC.GAME | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀░▀██████ ████▀░░░░░▀████ ███░░░░░░░░░███ ███▄░░▄░▄░░▄███ █████▀░░░▀█████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ███░░▀░░░▀░░███ ███░░▄▄▄░░▄████ ███▄▄█▀░░▄█████ █████▀░░▐██████ █████░░░░██████ ███████████████ ███████████████ ███████████████ ███████████████ | ███████████████ ███████████████ ███████████████ ███████████████ ██████▀▀░▀▄░███ ████▀░░▄░▄░▀███ ███▀░░▀▄▀▄░▄███ ███▄░░▀░▀░▄████ ███░▀▄░▄▄██████ ███████████████ ███████████████ ███████████████ ███████████████ | │ │ | DEPOSIT BONUS .1000%. | GET FREE ...5 BTC... | │ │ | REFER & EARN ..$1000 + 15%.. COMMISSION | │ │ | Play Now |
|
|
|
|