Bitcoin Forum
May 06, 2024, 11:31:55 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Standalone Database with Merit data + Full user details (Trust included)  (Read 296 times)
Piggy (OP)
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1416



View Profile WWW
October 09, 2018, 07:30:25 AM
Merited by TMAN (5), suchmoon (4), Coin-1 (1), DdmrDdmr (1), TheBeardedBaby (1), vlad230 (1), mainconcept (1)
 #1

Since there is interest in having all the data available on users in relation with the merit system to do all sort of analysis and investigations, i'm releasing in here a full version of the database used for Tool to run online SQL queries over Full Merit Data(which will continue to use a limited subset of user information in order to have a light solution and avoid the database to grow too much)

The data will be updated possibly every one or two weeks, since it's 29k users at the moment and bound to grow.

You can download the latest version of the full database (SQLite database), with few scripts that will help you to extract the data here: https://drive.google.com/open?id=1l0QQJl2NcAVwbXz49sSK6X5D1xmXxHlw

Here is a brief description of the necessary step to use the database:

Download the Precompiled Binaries for your favourite platform here (we just need the sqlite3):
https://www.sqlite.org/download.html
Quote
"A bundle of command-line tools for managing SQLite database files, including the command-line shell program, the sqldiff program, and the sqlite3_analyzer program."

If you want to know more in detail how to use Sqlite3 directly: https://www.sqlite.org/cli.html

Example for usage under Windows platform, from command line:

Put the database, scripts and the sqlite3.exe in the same folder and you have everything you need to start:

Query.bat file:
Code:
sqlite3 btctalk_full.db < CommandsQuery.txt

CommandsQuery.txt:
Code:
.mode csv
.separator "\t"
.output ResultQuery.csv
SELECT * FROM UserData WHERE Activity > 1000;
.output stdout
.quit

Now from command line call Query.bat and after a bit of crunching you will get the data in the ResultQuery.csv tab separated.

If you just need to update your existing data by adding the trust you can simply substitute the query in CommandsQuery.txt with:
Code:
SELECT Username,Trust FROM UserData;

Each user data row specify when the data was read the last time through the column LastUpdatedData

If you need help or some clarification let me know.
1714995115
Hero Member
*
Offline Offline

Posts: 1714995115

View Profile Personal Message (Offline)

Ignore
1714995115
Reply with quote  #2

1714995115
Report to moderator
1714995115
Hero Member
*
Offline Offline

Posts: 1714995115

View Profile Personal Message (Offline)

Ignore
1714995115
Reply with quote  #2

1714995115
Report to moderator
1714995115
Hero Member
*
Offline Offline

Posts: 1714995115

View Profile Personal Message (Offline)

Ignore
1714995115
Reply with quote  #2

1714995115
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
vlad230
Sr. Member
****
Offline Offline

Activity: 616
Merit: 279



View Profile
October 09, 2018, 08:29:02 AM
 #2

Thanks a lot for providing the database with extended user data Smiley Will probably use it to create some elaborate queries  Wink

May I ask what extra information have you added to the UserData table? I see Trust & Activity in your queries, anything else?

Also, does the UserData table only contain users that have touched merit at some point (have sent/received merit)?
Piggy (OP)
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1416



View Profile WWW
October 09, 2018, 08:56:39 AM
 #3

Thanks a lot for providing the database with extended user data Smiley Will probably use it to create some elaborate queries  Wink

May I ask what extra information have you added to the UserData table? I see Trust & Activity in your queries, anything else?

Also, does the UserData table only contain users that have touched merit at some point (have sent/received merit)?


Userdata contains all the users that either sent or received any merit and these are the column on the table:

Code:
UserData(
UserId Integer PRIMARY KEY,
UserName TEXT,
Trust TEXT,
Merit Integer,
Posts Integer,
Activity Integer,
Rank TEXT,
DateRegistered Integer,
LastActive Integer,
ICQ TEXT,
AIM TEXT,
MSN TEXT,
YIM TEXT,
Email TEXT,
Website TEXT,
Bitcoinaddress TEXT,
Gender TEXT,
Age TEXT,
Location TEXT,
LastUpdatedData Integer
);
vlad230
Sr. Member
****
Offline Offline

Activity: 616
Merit: 279



View Profile
October 09, 2018, 09:15:08 AM
 #4

Userdata contains all the users that either sent or received any merit and these are the column on the table:
[...]
Looking good! Smiley So, you've added all the data from the profile.
How are you getting this data? Are you crawling every profile or processing some SMF APIs that give you this data in a more friendlier format like xml or json? Feel free to send me a PM if you like.

Any chance the approach you're using to get all of this data for these users could be applied to get the data for all of the users on BitCoinTalk?
I know this may increase the DB size exponentially, but I think we could reduce the DB size by only including these fields:

Code:
UserId Integer PRIMARY KEY,
UserName TEXT,
Trust TEXT,
Merit Integer,
Posts Integer,
Activity Integer,
Rank TEXT,
DateRegistered Integer,
LastActive Integer

I think these are the most relevant fields for any user, that could help us in creating nice stats.

Maybe I could help out with this process, just give me a heads up if this can be done with your approach.
Piggy (OP)
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1416



View Profile WWW
October 09, 2018, 12:31:32 PM
 #5

That i'm aware of i don't think there is any other way to get the data other than scraping the content from the web page.

The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something
mainconcept
Sr. Member
****
Offline Offline

Activity: 588
Merit: 422


View Profile
October 10, 2018, 07:55:58 AM
 #6

Thanks for sharing your database!

I thought about scraping BCT myself. Currently, I'm learning a bit about Python and web scraping.

~~~
The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something

Maybe you could group the user profiles in inactive, active and archived profiles like Vod did it with bpip.org.

Should reduce the scraping time a lot.
vlad230
Sr. Member
****
Offline Offline

Activity: 616
Merit: 279



View Profile
October 10, 2018, 08:09:58 AM
 #7

That i'm aware of i don't think there is any other way to get the data other than scraping the content from the web page.

The space is not an issue, the only problem in doing this is that will take several days to complete from 1 machine, if we consider a round number of 2M user and hitting the forum at 1 user/s it should take ~556 hours, we can round up at 23 days  Huh, unless i'm missing something
Yeah, it's going to take some time to get that. I'll work on something more efficient if I have the time and get the data for all the users.

[...]
Maybe you could group the user profiles in inactive, active and archived profiles like Vod did it with bpip.org.

Should reduce the scraping time a lot.
You would still need to get the data first to know if the user is active or inactive so, it's not that straight forward as it seems.
Where does Vod get active/inactive data about users?
Piggy (OP)
Hero Member
*****
Offline Offline

Activity: 784
Merit: 1416



View Profile WWW
October 10, 2018, 08:30:04 AM
Last edit: October 10, 2018, 09:10:23 AM by Piggy
 #8

If i remember correctly, i may have read somewhere the archived accounts are on the order of some x00k, but still i would need an offline list with the userid to exclude directly.  Undecided

I think Vod is probably looking at the last time a user had logged to define if he/she is active/inactive.

I was thinking, if exists somewhere a list of public proxies which actually work, i could cut down the time quite significantly. I imagine the server can handle many request per second, but i would not want to make a mess  Smiley
Vod
Legendary
*
Offline Offline

Activity: 3696
Merit: 3070


Licking my boob since 1970


View Profile WWW
October 10, 2018, 05:43:53 PM
 #9

Where does Vod get active/inactive data about users?

By their Last Active date.  90 days and I mark them as inactive.  Unless they have no posts and no trust, then I move them to Archive.

https://nastyscam.com - landing page up     https://vod.fan - advanced image hosting - coming soon!
OGNasty has early onset dementia; keep this in mind when discussing his past actions.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!