I wouldn't have come across this if it hadn't been for the long saga of quickseller abuse I've been suffering. But recently he's started using a new alt/sockpuppet to try to attack me and I started looking more closely at the situation. I realized that it seems that Panthers52 has done several deals which were escrowed by Quickseller. The fact that Quickseller is escrowing for himself seems like a scammy behavior. I'm not a trader here so it may be that there's nothing wrong with this. But in any case, I'll go ahead and present some quantitative evidence here and you guys can discuss it as you please.
I happen to have some training in Statistical Methods for Natural Language Processing so I know a thing or two about how people use language and how to measure it quantitatively. Although QS does a few funny things to try to disguise his use of Panthers52 as an alt (he doesn't use a sig-ad, he signs each message with "Kind Regards", etc), these techniques are not very robust---they don't disguise QS's style of writing at all when looked at from a big-picture perspective, and this is just what language modeling allows us to do.
One reason that I set out to do this experiment is because all of the pieces are there. QS has written a pretty large corpus of posts under his main account. And there's a secondary account as well (one of his alts which was outed only a few months ago) to do model checking on. So, here's the big picture set up. We're going to download the corpus of posts of Quickseller, ACCTSeller (his outed alt), Panthers52 (his accused alt), hilariousandco, dooglus, and me. We'll then build language models using maximum likelihood parameter estimation for all of the ngrams in each corpus up to n=3. For those who don't know, 1-grams are all of the single word tokens in the corpus, 2-grams (called bigrams) are all of the word pairs, 3-grams are all of the word triples, etc. The reason I don't use 4 grams or any higher n is that the data just gets more and more sparse the higher you go, unless you have an incredibly large amount of data. For this project, a 3 gram model seemed appropriate (and the 3-gram section wasn't terribly sparse). So, step one, I downloaded all of the posts of theses members as raw html. I used this script:
#!/bin/bash
u=$1
outdir=$2
curl --data "action=profile&u=${u}&sa=showPosts" https://bitcointalk.org/index.php > $outdir/page0.html
dend=`cat $outdir/page0.html | sed -n -e 's/.*>\([0-9]\+\)<\/a> <span class="prevnext.*/\1/p'`
# dend=`cat $outdir/page0.html | sed -n -e 's/.*Pages:.*\.\.\. <\/b><a class="navPages" href="https:\/\/bitcointalk.org\/index.php?action=profile;u=[0-9]\+;sa=showPosts;start=[0-9]\+">\([0-9]\+\).*/\1/p'`
end=`echo "$dend" | head -n 1`
echo $end
i=1
while [[ $i -le $end ]]; do
start=$(($i*20))
curl --data "action=profile&u=${u}&sa=showPosts&start=$start" https://bitcointalk.org/index.php > $outdir/page${start}.html
((i= i+1))
done
What's going on here is that you pass in a UID and a output directory and then use curl to get the first page of the "recent posts" of this member. You then use sed to grab the last page of the post history, then you loop and do curl on each page and save the entire html into an output directory. After doing this, I had a directory called rawhtml/ with subdirectories for each of the accounts in my experiment.
The next step was to strip out all of the irrelevant html stuff. Thankfully, the html has a class "post" which contains people's posts. And has another class for quotes and quoteheaders so it's pretty easy to load a page into beautifulsoup html parser, strip out the quotes and quoteheaders. Here's my short-n-sweet python script to leave you with what I call "rawposts".
#!/usr/bin/env python
import sys
import os
from bs4 import BeautifulSoup
indir = sys.argv[1]
outdir = sys.argv[2]
for infile in os.listdir(indir):
soup = BeautifulSoup(open(indir+"/"+infile),'html.parser')
quoteheaders = soup.find_all("div", "quoteheader")
for qh in quoteheaders:
qh.extract()
quotes = soup.find_all("div", "quote")
for q in quotes:
q.extract()
posts = soup.find_all("div","post")
f = open(outdir+"/"+infile,"w")
for p in posts:
print>>f, p
print("done writing "+infile)
So, I ran this script to create a subdirectory for each account in the experment and I end up with a collection of posts, still as html, but without the embedded quotes. The next step was to tokenize the file and to do some final cleanup before building the models. By tokenize, I explicitly want to deal with punctuation and other funny stuff. Imagine, if you leave periods and question marks stuck to the sides of words then you get some really funny counts which misses generalizations. In fact, a period is a really common token at the end of a sentence so you want your model to have a high count of "." as a unigram, of ". </s>" as a bigram. But if you leave the periods stuck to words you'll end up with lots of singletons "something.", "do.", "find." etc. I also realized that the smiley html tags would be better replaced by single tokens so that we could see how they play into sentences. Finally, I wanted to replace links which still showed up as <a href="..." target="_blank">link text with only their href value. The rest is just constant and gets in the way of measuring what urls are actually being references. This latter point could be important in identifying authorship.
So, I made a sed file and tokenized the corpus. Here's my sed file:
# change smiley html for a tag
# remove <div class="post"> and <\/div>
s/\(<div class="post">\)\|\(<\/div>\)//g
s/<img alt="[A-Za-z]\+" border="0" src="https:\/\/bitcointalk.org\/Smileys\/default\/\([A-Za-z.]\+\)"\/>/--\1--/g
# change <br> for a real line break
s/<br\/>/\n/g
s/<hr\/>/\n/g
# do sentence breaking after . and ! and ? when space cap
s/\([?!\.]\)\s\+\([A-Z]\)/\1\n\2/g
# cleanup links, just use their href as if it was text
s:</a>\|<a href=\|target="_blank">::g
# punctuation stuff
s/\([,\.?]\)\($\|\s\)/ \1 \2/g
s/'s/ 's/g
s/\([()]\)/ \1 /g
# cleanup any spurious space at the end of the lines
s/\s\+$/\n/g
I also piped the output of this through "sed -e '/^$/d'" to remove any blank lines. After doing this, I had what I thought was a pretty useable, tokenized, once "sentence" per line corpus of each of the accounts in my experiment. Hand inspection of the corpus showed that there was still some noise in there, but crucially, all of the corpora were run through the same preprocessing and tokenization scripts, so any noise wouldn't be biased.
So, the next step was to do ngram counts over each of these models. To do this, you simply count all of the 1, 2 and 3 grams in the corpus and create a counts file that you can use to create language models. Note, I'm quite happy to share these count files for anyone who wants to see them. The thing is that I guess they're a little too large for most pastebin services. The quickseller counts file is approximately 8MB, for example. I can tar these up and email them to anyone who's interested. Or if anyone has a site they don't mind hosting them on then I could send them to that person. Just let me know.
tspacepilot@computer:~/lm/counts$ ls -lah
total 43M
drwxr-xr-x 2 tspacepilot tspacepilot 4.0K Sep 4 12:05 .
drwxr-xr-x 8 tspacepilot tspacepilot 16K Sep 4 11:55 ..
-rw-r--r-- 1 tspacepilot tspacepilot 1.3M Sep 3 10:40 as.count
-rw-r--r-- 1 tspacepilot tspacepilot 16M Sep 4 08:21 d.count
-rw-r--r-- 1 tspacepilot tspacepilot 12M Sep 4 08:20 h.count
-rw-r--r-- 1 tspacepilot tspacepilot 617K Sep 3 10:41 pan.count
-rw-r--r-- 1 tspacepilot tspacepilot 8.2M Sep 3 10:38 qs.count
-rw-r--r-- 1 tspacepilot tspacepilot 5.8M Sep 3 10:40 tsp.count
The next step is to generate language models from the count files. I used Good-Turning smoothing over an MLE parameter estimation in order to generate plain text files that include the models. These models are in the standard NIST format. Here's the top of the file from tsp:
tspacepilot@computer:~/lm/lms$ head tsp.lm
\data\
ngram 1: type=21218 token=294893
ngram 2: type=117148 token=287741
ngram 3: type=215034 token=280589
\1-grams:
9787 0.0331883089798673 -1.4790148753233 ,
9243 0.0313435720752951 -1.50385151060555 the
8592 0.0291359916986839 -1.53557019528667 to
7152 0.0242528645983458 -1.61523695785429 </s>
7152 0.0242528645983458 -1.61523695785429 <s>
What you're seeing thereis the counts for each ngram type. So the tspacepilot model has 294893 tokens/word instances, which fall into 21218 types. To be clear for those who don't have a background in this, if I say "the" twice, that's two tokens and one type. Then, you see the start of the 1 grams section. You can see that I used a comma "," 9787 times and that the comma represents 0.033... of the probability mass of the unigram model, the second colum is that mass converted to a log value. Here I reused a perl script that I had made some time ago. It's short enough to show you the entirety here:
#!/usr/bin/perl
# Build ngram LM for given count file
# tspacepilot
use strict;
#setting up the input file handles
$#ARGV != 1 and die "Usage: $0 <ngram_count_file> <lm_file>\n";
my $ngram_count_file = $ARGV[0];
my $lm_file_name = $ARGV[1];
open(DATA, "<:", $ngram_count_file) || die "cannot open $ngram_count_file.\n";
open(OUT, ">:", $lm_file_name) || die "cannot open $lm_file_name for writing.\n";
my @data = <DATA>;
my %unis;
my $uni_toks;
my %bis;
my %flat_bis;
my $bi_toks;
my %tris;
my %flat_tris;
my $tri_toks;
#here we build up the hash tables that we'll use to print the answer
foreach my $line (@data){
my @tokens = split(/\s+/, $line);
my $l = $#tokens;
if($l<1){
print "error on this line of count file:\n$line\n";
print "l = $l";
} elsif($l==1){
#print "this is a unigram\n";
$unis{$tokens[0]}=$tokens[1];
$uni_toks += $tokens[1];
} elsif($l==2){
#print "this is a bigram\n";
$bis{$tokens[0]}{$tokens[1]}=$tokens[2];
$flat_bis{"$tokens[0] $tokens[1]"}=$tokens[2];
$bi_toks += $tokens[2];
} elsif($l==3){
#print "this is a trigram\n";
$tris{"$tokens[0] $tokens[1]"}{$tokens[2]}=$tokens[3];
$flat_tris{"$tokens[0] $tokens[1] $tokens[2]"}=$tokens[3];
$tri_toks += $tokens[3];
} else {
print "error on this line of count file:\n$line\n";
print "l = $l";
}
}
print OUT "\\data\\\n";
print OUT "ngram 1: type=",scalar keys %unis," token=$uni_toks\n";
print OUT "ngram 2: type=", scalar keys %flat_bis," token=$bi_toks\n";
print OUT "ngram 3: type=", scalar keys %flat_tris," token=$tri_toks\n";
print OUT "\\1-grams:\n";
foreach my $uni (sort {$unis{$b} <=> $unis{$a} or $a cmp $b } (keys %unis)){
my $prob = $unis{$uni}/$uni_toks;
my $lgprob;
$lgprob = log10($prob);
print OUT "$unis{$uni} $prob $lgprob $uni\n";
}
print OUT "\\2-grams:\n";
#compute output for two grams
my @two_gram_output;
foreach my $flat_bi(keys %flat_bis){
my ($firstword) = $flat_bi =~ m/(\S+)/;
my $denominator;
foreach my $secondword (keys % {$bis{$firstword}}){
$denominator += $bis{$firstword}{$secondword};
}
my $prob = $flat_bis{$flat_bi}/$denominator;
my $lgprob = log10($prob);
push(@two_gram_output, "$flat_bis{$flat_bi} $prob $lgprob $flat_bi\n");
}
my @sorted_two_grams = sort{(split /\s+/,$b)[0] <=> (split /\s+/,$a)[0]} @two_gram_output;
#print output for two grams
foreach (@sorted_two_grams){
print OUT;
}
#compute output for 3grams
print OUT "\\3-grams:\n";
my @three_gram_output;
foreach my $flat_tri (keys %flat_tris){
my ($first_two_words) = $flat_tri =~ m/(\S+\s+\S+)/;
my $denominator;
foreach my $thirdword (keys % {$tris{$first_two_words}}){
$denominator += $tris{$first_two_words}{$thirdword};
}
my $prob = $flat_tris{$flat_tri}/$denominator;
my $lgprob = log10($prob);
push(@three_gram_output, "$flat_tris{$flat_tri} $prob $lgprob $flat_tri\n");
}
my @sorted_three_grams = sort{(split /\s+/,$b)[0] <=> (split /\s+/,$a)[0]} @three_gram_output;
#print output for 3grams
foreach(@sorted_three_grams){
print OUT;
}
sub log10 {
my $n = shift;
return log($n)/log(10);
}
Okay, with the language models all built (again, email me or PM me if you want to see the models themselves, I don't mind sharing them) we can start to get to the fun stuff. The goal of the experiment is to use the language models as predictors of the other accounts texts. The typical measure for this is called "perplexity" (
https://en.wikipedia.org/wiki/Perplexity). One nitty-gritty detail about this is what sorts of weighting to give to the 1,2,3 gram portions of the model when calculating perplexity. Intuitively, putting more weight into the 1 grams puts more value on shared single-words, ie, the basic vocabulary of the person. Putting more weight onto the 3-grams puts more weight on how that person puts words together, what three-word phrases they tend to use. I ended up using weights 0.3 0.4 0.3 (uni,bi,tri grams) in calculating perplexity. For each language model, I calculated the perplexity it assigns to each of the corpora of the accounts in the experiment. Here comes the fun stuff, then, the results:
As plain text, checking the QS language model against every corpus:
==> qstest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=1393
logprob=-119405.183085554 ave_logprob=-2.02254828472914 ppl=105.329078517105
==> qstest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=108735
logprob=-1963318.24588274 ave_logprob=-2.55783608776103 ppl=361.273484388214
==> qstest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=53676
logprob=-1514039.01569095 ave_logprob=-2.42022420176373 ppl=263.162620156841
==> qstest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1093
logprob=-53775.973489288 ave_logprob=-2.07397020669089 ppl=118.568740528906
==> qstest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=29664
logprob=-666393.992923604 ave_logprob=-2.5821718218487 ppl=382.09541103913
Well, as you can see, qs' model predicts my corpus with a perplexity of 382, predicts hillarious with 263, predicts dooglus with 361. But crucially, predicts the posts of ACCTSeller and Panthers52 at 105 and 118!!!!
What this means is that QS's posting style, when measured quantitatively shows through his attempts to hide what he was doing. This isn't too surprising for anyone who knows how language works, but it may be to others. For fun, I also ran each model as a predictor against each of the other corpora.
hillariousancco against all:
==> htest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2260
logprob=-136595.372784586 ave_logprob=-2.34820994988114 ppl=222.951269646594
==> htest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=109662
logprob=-1934327.44440288 ave_logprob=-2.52311368446967 ppl=333.513704608138
==> htest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1828
logprob=-60634.1796607556 ave_logprob=-2.40669126223528 ppl=255.088724501193
==> htest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=25750
logprob=-1193959.69530073 ave_logprob=-2.37727869117974 ppl=238.384871857193
==> htest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=26006
logprob=-662995.55023098 ave_logprob=-2.5330988076818 ppl=341.270546308425
So, we can see that hillarious doesn't really have a style predicts any of the rest of us better than another. At least not significantly. However, it
is interesting that hillarious' model assigns perplexities to all three of quickseller's accounts which are in the same range. This provides an oblique suggestion as to the similarities of those corpora. Here is dooglus' model predicting each of the other accounts:
==> dtest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2518
logprob=-141009.183781008 ave_logprob=-2.43488713532615 ppl=272.199382299313
==> dtest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=44764
logprob=-1532563.94318701 ave_logprob=-2.4154264735252 ppl=260.271415205445
==> dtest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=1752
logprob=-61358.7835651667 ave_logprob=-2.42812756490569 ppl=267.995538997277
==> dtest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=26384
logprob=-1223316.26268869 ave_logprob=-2.43880882666145 ppl=274.668481585288
==> dtest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=20198
logprob=-680500.394458114 ave_logprob=-2.5435368577456 ppl=349.572175864552
here's my model predicting all the other corpora
==> ttest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=2850
logprob=-139530.390079984 ave_logprob=-2.42324400972532 ppl=264.998862488461
==> ttest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=99717
logprob=-1946265.50900313 ave_logprob=-2.50617510057216 ppl=320.756230152803
==> ttest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=50287
logprob=-1518909.27782387 ave_logprob=-2.41492682099994 ppl=259.972147091511
==> ttest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=2043
logprob=-61310.1514410114 ave_logprob=-2.45446781060136 ppl=284.752673700336
==> ttest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=30864
logprob=-1209678.28851218 ave_logprob=-2.43335322477326 ppl=271.239680896164
Finally, we can also use the acctseller models and the panthers models to predict the other corpora. These models are a bit smaller than the qs model, so I think it's not as impressive as the results from the QS model. But they
do demonstrate the same pattern.
==> atest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=158655
logprob=-1864342.35403158 ave_logprob=-2.59784345298067 ppl=396.135216494324
==> atest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=87812
logprob=-1444217.53179264 ave_logprob=-2.44185825794015 ppl=276.603873729012
==> atest-panthers52-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=1663 num=25359 oov_num=2433
logprob=-54938.2415881704 ave_logprob=-2.23426091293548 ppl=171.498731827101
==> atest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=36302
logprob=-1072293.35965131 ave_logprob=-2.18084989129508 ppl=151.652610771117
==> atest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=47163
logprob=-623320.832692272 ave_logprob=-2.59095185177354 ppl=389.898758003026
Again, dooglus, me and hillariuos are all above 270 whereas the other known quickseller account is at 151 and the "suspected" alt is at 171. And with the panthers model:
tspacepilot@computer:~/quickseller/ppls/ptest$ tail -n 3 *
==> ptest-acctseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=2722 num=57708 oov_num=5835
logprob=-126943.515020739 ave_logprob=-2.32518573167395 ppl=211.439309416701
==> ptest-dooglus-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=48667 num=827638 oov_num=200298
logprob=-1733046.66220228 ave_logprob=-2.56365194769031 ppl=366.144021870075
==> ptest-hilariousandco-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=42799 num=636455 oov_num=110187
logprob=-1420281.45120892 ave_logprob=-2.49580708635173 ppl=313.18942275869
==> ptest-quickseller-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=24371 num=503617 oov_num=55974
logprob=-1089757.40317691 ave_logprob=-2.30873957801444 ppl=203.582094424962
==> ptest-tspacepilot-3.4.3.ppl <==
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
sent_num=7150 num=280589 oov_num=56725
logprob=-602993.466557261 ave_logprob=-2.61020313295844 ppl=407.570866725746
Again, the panthers model is actually the smallest in terms of input data, so you can see how it's a little less robust for that reason. Nevertheless, the similarities with the acctseller corpus and the quickseller corpus really stand out when comparing to values assigned to the dooglus, hillarious and tspacepilot corpora.
Lets summarize this in a table:
| qs | accts | pan52 | doog | hilarious | tsp |
qs | X | 105.3 | 118.1 | 361.2 | 263 | 382.1 |
accts | 151.6 | X | 171.4 | 396.1 | 276.6 | 389.9 |
pan52 | 203.5 | 211.4 | X | 366.1 | 313.1 | 407.6 |
doog | 274.6 | 272.1 | 267.9 | X | 260.3 | 349.5 |
hilarious | 238.3 | 222.9 | 255.1 | 333.5 | X | 341.2 |
tsp | 271. | 264.9 | 284.7 | 320.7 | 259.9 | X |
So, one thing I want to be clear on. Perplexity measures how well a model predicts a certain corpus. The first row shows us that the QS model predicts the acctseller and panthers52 corpora at approximately equally well, and far better than it predicts any of the others. Most of the other rows here are just providing prespective to you. You can see that the dooglus, hillarious and tsp models don't predict any of the other corpora very well (nothing anywhere below 250).
For completeness, here's the script I used to calculate perplexity:
#!/usr/bin/perl
#Build ngram LM for given count file
#
use strict;
use Try::Tiny;
#setting up the input file handles
$#ARGV != 5 and die "Usage: $0 <lm_file> <l1> <l2> <l3> <test_data> <output>\n";
my $lm_file = $ARGV[0];
my($l1,$l2,$l3) = ($ARGV[1], $ARGV[2], $ARGV[3]);
my $test_data = $ARGV[4];
my $output = $ARGV[5];
open(LM, "<:", $lm_file) || die "cannot open $lm_file.\n";
my @data;
if ($test_data eq "-") {
@data = <STDIN>;
} else {
open(DATA, "<:", $test_data) || die "cannot open $test_data.\n";
@data = <DATA>;
}
open(OUT, ">:", $output) || die "cannot open $output for writing.\n";
my $lmstring;
while (<LM>){
$lmstring .= $_;
}
#build up the lm data structures for quicker retreival
my @lm = split (/\\data\\|\\1-grams:|\\2-grams:|\\3-grams:/ ,$lmstring);
shift @lm;
my @data_lines = split (/\n/, $lm[0]);
my @one_gram_lines = split(/\n/, $lm[1]);
my @two_gram_lines = split(/\n/, $lm[2]);
my @three_gram_lines = split(/\n/, $lm[3]);
my %unis;
foreach (@one_gram_lines){
my($prob, $w)=$_=~/\S+\s+(\S+)\s+\S+\s+(\S+)/;
$unis{$w}=$prob;
}
my %bis;
foreach (@two_gram_lines){
my($prob, $w1, $w2)=$_=~/\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)/;
$bis{"$w1 $w2"}=$prob;
}
my %tris;
foreach (@three_gram_lines){
my($prob, $w1, $w2, $w3)=$_=~/\S+\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)\s+(\S+)/;
$tris{"$w1 $w2 $w3"}=$prob;
}
my $sum;
my $cnt;
my $word_num;
my $oov_num;
my $sent_num;
for my $s (0 .. $#data){
if($data[$s]=~m/^\s*$/) {
next;
}
$sent_num++;
chomp $data[$s];
$data[$s] = "<s> ".$data[$s]." </s>";
my @words = split /\s+/, $data[$s];
print OUT "\n\nSent #".($s+1).": @words\n";
my $sprob = 0;
my $soov = 0;
for my $i (1 .. $#words){
$word_num++;
if($i==1){
#w1 given <s>:
my ($w1, $w2) =($words[$i-1], $words[$i]);
my $onegramprob;
my $twogramprob;
my $unknown_word;
my $smoothed_prob;
if(defined($unis{$w2})){
$onegramprob = $unis{$w2};
} else {
$unknown_word = 1;
}
if (!$unknown_word){
if(defined($bis{"$w1 $w2"})){
$twogramprob = $bis{"$w1 $w2"};
} else {
$unknown_word = 1;
}
}
if ($unknown_word) {
$smoothed_prob = "-inf (unknown word)";
$soov++;
} else {
$smoothed_prob = log10((($l3+$l2) * $twogramprob)+($l1*$onegramprob));
$sprob+= $smoothed_prob;
}
print OUT ($i);
print OUT ": LogP( $w2 | $w1 ) = $smoothed_prob\n";
} else {
my ($w1, $w2, $w3) = ($words[$i-2], $words[$i-1], $words[$i]);
my $threegramprob;
my $twogramprob;
my $onegramprob;
my $unknown_word;
my $unknown_ngram;
my $smoothed_prob;
#the trigrams
if(defined($unis{$w3})){
$onegramprob = $unis{$w3};
} else {
$unknown_word = 1;
}
if(defined($bis{"$w2 $w3"})){
$twogramprob = $bis{"$w2 $w3"};
} else {
$unknown_ngram = 1;
}
if(defined($tris{"$w1 $w2 $w3"})){
$threegramprob = $tris{"$w1 $w2 $w3"};
} else {
$unknown_ngram = 1;
}
print OUT ($i);
if ($unknown_word) {
print OUT ": LogP( $w3 | $w1 $w2 ) = -inf (unknown word)";
$soov++;
} elsif ($unknown_ngram){
$smoothed_prob = log10(($l3*$threegramprob)+($l2*$twogramprob)+($l1*$onegramprob));
print OUT ": LogP( $w3 | $w1 $w2 ) = $smoothed_prob (unknown ngrams)\n";
} else {
$smoothed_prob = log10(($l3*$threegramprob)+($l2*$twogramprob)+($l1*$onegramprob));
print OUT ": LogP( $w3 | $w1 $w2 ) = $smoothed_prob\n";
}
$sprob+=$smoothed_prob;
}
}
my $sppl = 10**(-($sprob/($#words-1)));
print OUT "1 sentence, ".($#words-1)." words, $soov OOVs\n";
print OUT "logprob=$sprob, ppl=$sppl";
$sum+=$sprob;
$oov_num+=$soov;
$cnt += $#words-1;
}
my $ave_logprob = $sum/($sent_num + $cnt - $oov_num);
my $ppl = 10**(-$ave_logprob);
print OUT "\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n";
print OUT "sent_num=$sent_num num=$cnt oov_num=$oov_num\n";
print OUT "logprob=$sum ave_logprob=$ave_logprob ppl=$ppl\n";
sub log10 {
my $n = shift;
return log($n)/log(10);
}
In sum, we know that Quickseller is adept at checking the blockchain to reveal transactions signed by particular accounts and to link them. So it makes sense that he knows how to cover his tracks there and to use mixers and whatnot to make it difficult to detect his alts in that way. He is an expert in this, so while I haven't tried, I suspect it would be difficult to link any of his accounts on the blockchain. However, presumably, he's not an expert in forensic linguistics and statistical NLP so he didn't realize that providing a corpus of 552365 word tokens would actually give someone who wanted to detect his alts a reasonably reliable way to find the statistical fingerprint which is right there in the statistics of how he writes.
There's plenty of other circumstantial evidence that Panthers52 is an alt of Quickseller, but I'll leave that for others to talk about and discuss. Also, I'm not a trader here so I'm not really affected by QS giving escrow for himself, but perhaps others who are will have more to say about whether this practice is truly a scam. I opened this thread here because it seemed like scammy behavior to me, and I wanted others to be aware of it.
Here is a screenshot of QS feedback taken today:
Again, if anyone has any questions about this experiment or wants access to the particular data I ended up using, just let me know. I believe I've provided all the tools in this post in order to replicate these results for yourself, but if something's missing, let me know about it.