Bitcoin Forum

Other => Meta => Topic started by: theymos on March 16, 2018, 04:13:52 AM



Title: Additional data dumps?
Post by: theymos on March 16, 2018, 04:13:52 AM
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.


Title: Re: Additional data dumps?
Post by: SFR10 on March 16, 2018, 05:00:58 AM
What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
This (the rest, aren't that important). I hope accounts with 0 post/activity are excluded (to eliminate having a massive file for information that's not needed).

Can we get another weekly dump, in form of tracking the positive/negative ratings (ex. Sent from where and sent to where) and also knowing removed ratings from someone? (Credit goes to Vod (https://bitcointalk.org/index.php?action=profile;u=30747), based on this thread (https://bitcointalk.org/index.php?topic=2841315.0)).


Title: Re: Additional data dumps?
Post by: MadZ on March 16, 2018, 05:11:17 AM
It might be helpful to have a continuous version of the seclog without having to rely on archived pages.


Title: Re: Additional data dumps?
Post by: botany on March 16, 2018, 05:33:12 AM
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

Modlog definitely.


Title: Re: Additional data dumps?
Post by: Quickseller on March 17, 2018, 04:41:21 AM
I might suggest dumping the post history of individual users/accounts. This could be restricted by rank and otherwise be rate limited. I think this would be difficult to recreate any meaningful mirror site with this information.

As others have mentioned, the security log would be beneficial. The mod log, not so much because of its limited information.

It would be helpful if users outboxes (and other folders) can be downloaded since they cannot be easily searched. Obviously downloading this information would be restricted to users who are logged into their own account.


Title: Re: Additional data dumps?
Post by: MyIota on March 17, 2018, 05:17:39 AM
Just fyi,

You can see and gauge how much sMerit someone has simply by the transparency of the system. So that's a data dump hidden field.

You can calculate how much they've receieved versus how much they've sent... and from there you'll know how much sMerit they have left :/


Title: Re: Additional data dumps?
Post by: LoyceV on March 17, 2018, 09:01:20 AM
UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?


Title: Re: Additional data dumps?
Post by: sncc on March 17, 2018, 02:37:35 PM
It seems to me that some local boards do not have sufficient smerit distribution, and it would be good to clarify that directly from data dump, which would help designing an appropriate distribution of merit sources.  It would be useful to have

post ID, topic ID, board ID, merit

and check how much each local board is active and whether sufficient smerits are distributed.  Of course spams and non-high-quality posts will be counted but I assume they are roughly proportional to the total number of posts.


Title: Re: Additional data dumps?
Post by: DdmrDdmr on March 17, 2018, 05:01:52 PM
It all comes down really to what needs to be found out. That is, building a set of questions that need to be answered and derive the raw data information that enables an aggregated or derived dataset to be queried for the answers.

Some questions are answerable by a snapshot of the data, whilst others require the inclusion on a timeframe and datestamps to resolve.

For example, in order to see how long it takes to rank up for members, we would need the whole history per UserId  of rank changes <UserId, Rank, Activity, Date>, where the registry would only be necessary to be created when there is a user creation or a change in the Rank, being Date the associated timestamp.
If we wanted to see this in relation to Merit, we would need to build a registry in the shape of <UserId, Rank, Activity, Merit, InitialMerit, Date > .

The other key factor is related to the current way in which data is stored. The raw data layout and capture process is part of the process to reach our solution goal.
For example, if there is a trigger in the database that currently logs  changes on the User Table for the <UserId, Rank, Activity, Date> record structure, the underlying table is direct and all that has to be done, once exported, is to select records that relate to a change in user’s rank (and ignore those that are a mere activity change).

If alas the underlying user table does not hold a historical record of changes (i.e. no logged timestamp historical), then the question of how long it takes to rank up would not be answerable or need to be crossed with other raw data from another table.

Questions that I would boldly put on the list due to sMerit introduction would be such as:

- What is the average time per Rank to rank-up?: before and after the introduction of the Merit system (this is not entirely comparable yet, since merit system is only a few months old so top Ranks are not comparable yet).

- How much sMerit is assigned per rank (from/to), per forum section, per forum subsection, in relation to number of posts in topic, in relation to topic heatness, in relation to post position in topic (quartiles for example), in relation to size of merited post, etc.

- How much sMerit is being withheld and for how long (averages).

- Round merit assignment candidate (from User A to User B and back -> That is derivable from current Merit.txt file as I’ve posted previously – it is not necessarily a cheat, but a source of study for such cases).

The match between a closed set of key questions to answer, and potential raw data structure should give us what additional files are required in my opinion.


Title: Re: Additional data dumps?
Post by: suchmoon on March 17, 2018, 09:30:36 PM
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.

All of the above, plus

Starting merit, starting sMerit, activity, rank for each user

This should allow us to see who's doing well (or not) at sending merits. Ideally we would also want merit source info but you didn't seem to want to publish that.

UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?

Usernames should be double-quoted then, and double quotes should be doubled inside double quotes... Yes, CSV format sucks but there is an RFC document for it and most modern tools should be able to handle that.





Title: Re: Additional data dumps?
Post by: 1020kingz on March 18, 2018, 01:58:26 AM
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:

 UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name

I'm not going to dump post contents in any form, since that would both be a massive file and it'd make things very easy for those annoying phishing mirror sites.
i think the UID -> name, merit, potential activity, post is useful. in this you can easily compile the post contents of a user and create an outbox for each user to be compiled into it and easy to look or search the users activity and recent post, also some useful ideas are suggested like this
UID -> name, merit, potential activity, posts
I can think of a few:
1. Add "Activity" (not just "potential")
2. Add a banned-status to this list (ignore temporary bans)
3. Add either "merit earned" or "merit received for free at introduction"

Side note: there are more than 200 usernames with a comma, this will make processing a CSV difficult. Can you make this a file with just UID and name?
you can also monitor the give and take of merits by each user. this is what i understand by this thread please feel free to correct me if im wrong.


Title: Re: Additional data dumps?
Post by: mobilazy on March 18, 2018, 08:45:12 AM
I hope user zentdex will come up with some beautiful and informative charts. I'd love to see his posts.

Meanwhile, I will try to come up with something decent myself. That will be the perfect way to study data analyze.


Title: Re: Additional data dumps?
Post by: JeremyB on March 18, 2018, 10:25:06 AM
I was asking for such information here (https://bitcointalk.org/index.php?topic=3151741.0) today and just see this thread now.

I think all dumps related to forum architecture will be great to compute local boards stats.

I am especially interested in analyses of this data which could point to sub-communities where the initial sMerit is exhausted and new sources are necessary, and people who might be good merit sources.

This kind of requests would be easier to implement.

And what about some automatic dump archiving to avoid several people to do the same?


Title: Re: Additional data dumps?
Post by: esmanthra on March 18, 2018, 12:17:08 PM
Recently someone asked about their account which was hacked in december, and I even didn't have a possibility to look at the date it happened (since it's gone from the page). So the security log dump would be indeed helpful.


Title: Re: Additional data dumps?
Post by: Joel_Jantsen on March 18, 2018, 04:44:00 PM
Currently there are two big data dumps available which auto-update weekly, trust.txt.xz and merit.txt.xz. These auto-updating dumps are pretty easy to set up, so I was thinking that it might be a good idea to produce several more of these, perhaps in the end forming a "ghetto API". What dumps would be most useful? Some that I was thinking of were:
How big of a operation is to auto-update the data on a daily basis ? I was thinking I could set-up end points which downloads the file daily and keep my source for the charts (whatever I choose to represent) updated every day.

It would be great if you can send a status flag along with the account details like "active/inactive/banned".


Title: Re: Additional data dumps?
Post by: TheBeardedBaby on March 19, 2018, 10:28:34 AM
For me it will be useful to get a data of all the users IDs posting in a specific topic and time, like in the ANN section.
If we can get a UID and Time on a topic, I can easily check for ICO pumpers.


Title: Re: Additional data dumps?
Post by: DdmrDdmr on March 24, 2018, 07:11:54 PM
Is there a possibility of including the Rank in the merit.txt file or having another file to complement it so as to perform rank analysis tied to data in the merit.txt file?
I've seen Zentdex managed to cross this information, but it's not in the public raw data files for general usage as far as I can see.

It's True that Rank will vary for some user's within the timeframe of data within the merit.txt file, but is would be a helpful source to breakdown data and comprehend it better.


Title: Re: Additional data dumps?
Post by: Jet Cash on March 24, 2018, 07:31:22 PM
Deleted posts that have been awarded merit.


Title: Re: Additional data dumps?
Post by: LoyceV on April 16, 2018, 11:50:34 AM
Any follow up on this?


Title: Re: Additional data dumps?
Post by: mobilazy on April 16, 2018, 04:21:36 PM
I wish it was in csv format as easiest one to work with. I'd love to practice my Seaborn skills what I learned from short Udemy course.



Title: Re: Additional data dumps?
Post by: TheBeardedBaby on February 08, 2019, 08:37:25 AM
It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?


Title: Re: Additional data dumps?
Post by: 100bitcoin on February 08, 2019, 12:15:11 PM
It's not a necrobump.
Can we have the modlog and seclog dumps instead everyone to scrape the data from the server?


Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.


Title: Re: Additional data dumps?
Post by: TheBeardedBaby on February 08, 2019, 12:23:34 PM
Can one still able to scrape data from the server? I thought theymos prohibited it since bitcointalk.to started to scrape the whole forum.

Both LoyceV and Vod are doing it, also i've seen other users too, so I think there is no any prohibition, yet.
I think if those dumps are available for download directly from the forum, more people can benefit out of it and there will be less traffic to the server.


Title: Re: Additional data dumps?
Post by: LoyceV on February 08, 2019, 12:27:18 PM
Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k (https://bitcointalk.org/index.php?topic=5092983.0) banned users now, but it's far from complete.


Title: Re: Additional data dumps?
Post by: TheBeardedBaby on February 08, 2019, 12:44:52 PM
Scraping works, but the current modlog covers only a limited time. It's not possible to get a complete overview of all banned users. From various sources, I have a list of 170k (https://bitcointalk.org/index.php?topic=5092983.0) banned users now, but it's far from complete.

Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things :)


Title: Re: Additional data dumps?
Post by: LoyceV on February 08, 2019, 05:02:00 PM
Is there any place to have the modlog in raw format available? Even for the limited time. I want to check some things :)
Do you mean older versions? I used archive.li and archive.org.


Title: Re: Additional data dumps?
Post by: tranthidung on January 09, 2020, 03:48:01 AM
UID -> name, merit, potential activity, posts
 post ID -> topic ID, time, UID
 topic ID -> board ID, first post ID
 board ID -> board name
The new year, so I bump it to ask for additional data dump granted by theymos.

Besides these formats above, I ask for this one (for merit data):
Code:
time amount msg user_from user_to boardid
1516831941  1 2818066.msg28853325 35 877396 24
I already collected the boardid, so if the merit data has only one additional variable for board's ID (boardid), it will eliminate the need to scrap data (from LoyceV's help) each 6 months. Although I don't know the others need such variable in data dumps or not.

For some sorts of analyses like these:
  • Time series plots of daily merits over boards (https://bitcointalk.org/index.php?topic=5211736.0)
  • Merit distributions over boards (https://bitcointalk.org/index.php?topic=5211269.0)


Title: Re: Additional data dumps?
Post by: tranthidung on January 11, 2020, 03:36:40 AM
From the reply (https://bitcointalk.org/index.php?topic=5216451.msg53569598#msg53569598) of admin yesterday, I think now it is a very good time to think of a consistent format for the forum's data dumps. Each dataset has different variables inside, but I think all of them should be connected with only common variable (at least one variable) - userid.

Username, no matter it is username or display name or both will result in differences when connecting different datasets dumped by the forum.

For additional data dumps, it is not the priority and I am not in a position to ask for it too much, but for current data formats, a small adjustment: from username to userid will be good.

LoyceV asked for this change too: https://bitcointalk.org/index.php?topic=5104467.msg53551686#msg53551686