For us to move forward meaningfully, we'll need either a full data dump or at least a large enough representative sample to train the model effectively,
I assume some of you already have a local copy or could potentially request access from the forum admins. If that is something anyone here can assist with, it would move things forward meaningfully (since we currently don't have the time to scrape the forum again from scratch).
Also, to clarify the scope: Should we focus just on topic titles and topic content, or include full threads + replies, discussions, and user context? The latter enables a much more powerful, context-aware search experience, but increases training requirements and cost.
Once we've got that clarified, we can prepare a working demo or prototype to showcase what the engine can do.
I've bolded the part that helps explain your disadvantage here. You are new to the forum and did not read the discussions we've had over the last month. There are a few forum post sources available (already discussed) and many users have already commented on what they want to see in the results.
I'm going to launch an API for retrieving Bitcointalk posts within the next couple weeks, so stay tuned.
Cool - if you mean querying posts it's always good to have multiple search options. I'd like this thread to stay in the spirit of an "AI assisted" search, as there there are already many topics on SQL queries and APIs. Right now both you and Ninja are working on APIs and may believe the first one to complete will be the winner, which I feel is false. The engine itself will not be the winner, but the various projects that use their results. The project I am working on will have the ability for a user to select their engine, soi please guys, take your time and do it right.

A proper AI search engine might cost close to a thousand dollars to build correctly (or 0.000013BTC out of the 1000BTC donated for projects like this) so there is no reason any user should have to pay. The largest obstacle will be Theymos anyway - imagine what people would uncover with a Deepseek interface trained on this forum.
