The warm tier of nodes is slower than the hot/content tier when fetching data, but has about 4x more storage capacity.
Currently, v1 (talksearch_bitcointalk in the picture) search is running on warm nodes. It used to be on the hot nodes, where searches were quite fast, but in the process of fixing my cluster, it got moved to warm.
All v2 indices besides English are on the hot nodes, however I'm not particularly satisfied with the amount of low-quality posts present in this index, so I'm considering moving the high-quality Engilsh posts to the hot nodes. Then there would be a checkbox on the site that reads "Only search high-quality posts".
The issue is, I currently don't have a reliable way to measure post quality.
By the way, http:// does not currently work on Talksearch. Use https://. I am thinking about redirecting all traffic to the https:// version anyway.
That's a lot of data.
Is the system cataloging all the words and things like that? How are you processing all the information?
Yes!
In fact I am excited to show you the advanced classifications that are available for the data.
Elasticsearch has a number of field types available for naturally processing JSON, more than just strings, numbers, and booleans. I am talking about things like rank features (common in search applications), vectors, points, geolocation stuff, dates, binary data, and many other stuff:

These are then used by Elasticsearch's query language. It is very advanced and can find stuff much more relevant than e.g. a regex-based search. This is where the true power lies.
Here's a list of things that its query language is able to do:
- It can boost or penalize certain keywords
- It supports semantic search e.g. "satoshi nakamoto identity" is supposed to return topics about Hal Finney or Nick Szabo
- Apply autocorrect and "keyboard shift"
- It can match text that only appears in a certain position in the post
- It can find "More like this" results
- You can attach scripts to create complex searches (but this is slow)
- Feature-based search is fully supported, so you can make queries for posts based on username, board, topic title, date etc in the exact same manner as for post body. (This last part is sorely missing from Google Site Search.)
There's a lot more stuff I didn't list that you can find
here. But the exact algorithm I use is proprietary.
I'm curious to see the posts per user sorted by your experimental algorithm. Would that be possible to search for?
That's a great idea, but with the current version of Talksearch this is not possible. It will eventually be available though.
Is 120 million the number of posts + edits?
No, 120 million is an estimate of the average number of chunks there would be if you split posts by quote body, or by line break [ hr ] tags. For example, this post I am writing would be indexed as 4 chunks, as they are separate pieces of information being written. This allows for users to see as specific results as possible from the search results page without having to navigate to the Bitcointalk post.
I assume there are on average a bit less than one quote or line break per post, hence why there's a bit less than 2 chunks on average (quotes etc. plus one).
I only store the latest revision of each post.