Bitcoin Forum

Other => Meta => Topic started by: ltcltcltc on December 23, 2023, 04:28:02 PM



Title: Downloadable topic-database?
Post by: ltcltcltc on December 23, 2023, 04:28:02 PM
I know LoyceV has put together a nice scrapable archive (http://loyce.club/archive/) of the topics of this forum.

I want to do a analysis of the BTT forum. I could write a script to scrape the data from LoyceV's archive but I was really wishing someone could point me towards a fully downloadable database to speed things up. Does anyone have a reference?

Cheers!


Title: Re: Downloadable topic-database?
Post by: LoyceV on December 23, 2023, 05:23:27 PM
How about you just ask nicely? ;) What do you need, and what's the goal? Or better: will you publish the results on Bitcointalk?


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 23, 2023, 07:44:27 PM
Haha I didn't think about that indeed.
I came across this (https://www.researchgate.net/publication/326427024_Sentiment_Analysis_To_Predict_Global_Cryptocurrency_Trends) sentimental analysis of BTT. It aims to infer a correlation between the temperature/feeling of this forum and the tendency of cryptos like Bitcoin. I found it interesting so I thought I'd try it myself, play around with the data, see what comes out. Might even be an intro to ML. My goal: learning. Oftentimes that leads to interesting results but one can never be certain. Still, if the least comes out of this you'll be the first to read about it.


Title: Re: Downloadable topic-database?
Post by: LoyceV on December 23, 2023, 08:11:53 PM
I've seen another sentiment analysis, which analysed posts concerning the block size discussion at Fork time. But it's only in my email, I can't find it online.

Have you thought about how you'd handle my data? It's a lot: millions of files, about 100 GB, and most file systems can't handle that many files without many subdirectories.


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 23, 2023, 09:30:35 PM
How is your data classified? Tree-structure or raw recent-first stack? In the second case, I'd probably reorganize it myself into a tree-like structure. This way should be quicker to filter out some data. Maybe start with the Bitcoin discussion board, then scale up.

Also I can always chop those 100GB into various time series. Perhaps the 2020-2022 time period contains jucier data than the rest (due to the rise and drop of BTC). Everything can be explored.


Title: Re: Downloadable topic-database?
Post by: LoyceV on December 23, 2023, 10:10:24 PM
How is your data classified?
See the link you started this topic with. WYSIWYG. Best I can do is a post number.


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 23, 2023, 11:26:11 PM
Thanks, btw what do you mean by WYSIWYG? I get what it stands for, and that it's CS slang, but how does it apply here? Also, I've seen that your website offers the functionality of showing any given user's messages. Is there an analogous way of filtering messages by board? Like: "showing Economy messages".

PD. I've seen your other work, quite impressive!


Title: Re: Downloadable topic-database?
Post by: Vod on December 23, 2023, 11:41:09 PM
Thanks, btw what do you mean by WYSIWYG?

Odd that you have never googled that phrase, but you've found an obscure website.   Could it have something to do with my recent suggestion (https://bitcointalk.org/index.php?topic=5469588.msg63359307#msg63359307)?   ;)


Title: Re: Downloadable topic-database?
Post by: LoyceV on December 24, 2023, 07:55:32 AM
Thanks, btw what do you mean by WYSIWYG? I get what it stands for, and that it's CS slang, but how does it apply here?
I don't know what "CS slang" is, but WYSIWYG stands for What You See Is What You Get. That's literally how my data files are.

Quote
Also, I've seen that your website offers the functionality of showing any given user's messages. Is there an analogous way of filtering messages by board? Like: "showing Economy messages".
Nope. That's TryNinja's specialty (https://bitcointalk.org/index.php?topic=5273824). Again: just ask nicely :)


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 24, 2023, 08:37:08 AM
CS means computer science.

Quote
That's literally how my data files are.

So, please tell me if I'm wrong: you just scrape content with limited data treatment, so the filtering functionalities you offer are the same that the forum offers, i.e. sorting by chronological order, viewing the posts inside a topic and filtering by user.

Quote
Again: just ask nicely

Ok! I thought by the previous message that you were declining. But if that's not the case then I'd be super grateful if you shared your database with me to avoid the undesirable task of rescraping the scraped!

Ninja's website looks handy too. Harder to scrape though.


Title: Re: Downloadable topic-database?
Post by: ABCbits on December 24, 2023, 08:55:41 AM
Quote
Also, I've seen that your website offers the functionality of showing any given user's messages. Is there an analogous way of filtering messages by board? Like: "showing Economy messages".
Nope. That's TryNinja's specialty (https://bitcointalk.org/index.php?topic=5273824). Again: just ask nicely :)

Link you mentioned TryNinja already offer API where it's documentation can be seen on https://docs.ninjastic.space/ (https://docs.ninjastic.space/). If OP willing to write script which download topic/reply from the API and wait for several days, it should be viable option.


Title: Re: Downloadable topic-database?
Post by: LoyceV on December 24, 2023, 08:59:02 AM
So, please tell me if I'm wrong: you just scrape content with limited data treatment, so the filtering functionalities you offer are the same that the forum offers, i.e. sorting by chronological order, viewing the posts inside a topic and filtering by user.
I don't process anything, I just keep the raw HTML (https://loyce.club/archive/posts/6337/63378236.html) for archiving purposes. Although I also keep a list per user and per topic.

Quote
Ok! I thought by the previous message that you were declining.
I meant ask TryNinja nicely if you want for instance only data from the Economics board.

Quote
But if that's not the case then I'd be super grateful if you shared your database with me to avoid the undesirable task of rescraping the scraped!
I don't have a database. I just have "data". And it's a lot. Hence my question if you know how you're going to handle it. Old posts (https://loyce.club/archive/oldposts/0/0xx.html#msg36) for instance are stored in a different format, although I may still have a backup of individual files for each post. Update: found it. That's the part where you'll get millions of files in one directory. So you'll have to be a bit more specific before I just dump a shitload of files on you :P

Quote
Ninja's website looks handy too. Harder to scrape though.
Don't scrape, ask :P


Title: Re: Downloadable topic-database?
Post by: TryNinja on December 24, 2023, 09:23:11 AM
I’m willing to give anyone a .csv or similar with any data that I have. Like Loyce said, all you gotta do is ask. :)


Title: Re: Downloadable topic-database?
Post by: digaran on December 24, 2023, 10:34:34 AM
My man triple ltc, can we start over? The analysis you linked above seems to be interesting, can you do a special analysis on price changes and my appearance on reputation and meta boards in the past 6 month? Lol, I mean is there a way to do that?
Earlier I thought you are one of the trolls harassing me, so apology for snapping at you. I appreciate the effort. 😉


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 24, 2023, 01:47:00 PM
I just keep the raw HTML for archiving purposes.
Ok. Then perhaps TryNinja's database fits better my purposes.

I’m willing to give anyone a .csv or similar with any data that I have.
I think a tree-like structure (board/subboard/topic/message) would work best so as to study conversations as a whole more than individual messages, since I don't care about individual opinions as much as I do about global sentiments.
So maybe JSON? Does this work for you? I mentioned the Economy board as an example; ideally I'd want the whole data.

It would be a super favour you'd be doing me.


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 24, 2023, 03:54:27 PM
It's a good idea too. I'll .append() it to the list. Ltc stands for other than litecoin.


Title: Re: Downloadable topic-database?
Post by: TryNinja on December 24, 2023, 05:14:01 PM
I think a tree-like structure (board/subboard/topic/message) would work best so as to study conversations as a whole more than individual messages, since I don't care about individual opinions as much as I do about global sentiments.
So maybe JSON? Does this work for you? I mentioned the Economy board as an example; ideally I'd want the whole data.
JSON is probably fine. Could you provide an example of the format you want with a dummy post?


Title: Re: Downloadable topic-database?
Post by: ltcltcltc on December 25, 2023, 01:21:33 AM
JSON is probably fine. Could you provide an example of the format you want with a dummy post?

Suppose there are boards B1 and B2. B1 has child boards B11 and B12. Each board is represented as a folder with the same name. The main foder, F, could be structured as follows (every instance of content.txt represents a file; the rest are folders).

F
├───B1
│   ├───content.txt
│   ├───B11
│   │   └───content.txt (*)
│   └───B12
│       └───content.txt
└───B2
    └───content.txt

Now here's what a content.txt file could look like. Suppose we're looking at (*).

{
    "name": "B11",
    "topics": [
        {
            "topicId": "1111",
            "subject": "Help me out plz",
            "op": {"userId": "3596085", "username": "ltcltcltc", "activity": 26, "merit": 60},
            "time": <timestamp of the original post>,
            "messages": [
                {
                    "msgId": "6666",
                    "author": {"userId": "3596085", "username": "ltcltcltc", "activity": 26, "merit": 60},
                    "time": <timestamp of this message (in this case the original post)>,
                    "merited": "2",
                    "message": "Hey does anyone know how to speed up ecdsa signature bruteforcing?"
                },
                {
                    "msgId": "6699",
                    "author": {"userId": "3597570", "username": "aleph1", "activity": 1, "merit": 23},
                    "time": <timestamp of this message>,
                    "merited": 0,
                    "message": "Stop wasting your time."
                }
            ]
        },
        {
            "topicId": "2222",
            "subject": "Test. Do not answer.",
            "op": {"userId": "3597570", "username": "aleph1", "activity": 1, "merit": 23},
            "time": <timestamp of the original post>,
            "messages": [
                {
                    "msgId": "8008",
                    "author": {"userId": "3597570", "username": "aleph1", "activity": 1, "merit": 23},
                    "time": <timestamp of this message (in this case the original post)>,
                    "merited": "3",
                    "message": "Testy test."
                }
            ]
        }
    ]
}


I didn't give any example of timestamp because I don't know what your time format is, but I think I'd prefer Unix time. Also note the redundancy: the topic's timestamp is the same as the timestamp on the first message of said topic. The topics inside each board are ordered chronologically (older first) and the messages inside each topic too.

What do you think about this format?