Guess I am back and need to find a productive way to support our POS coins (proof of shill) with fancy signatures so I am attempting to start the project I was starting before some jerk offered me a job and distracted me - the bitcointalk visualization project whatever. Basically using ML and visualization for things like shitposter modeling, shilling detection, strange sentiment analysis etc....
Anyhoo who knows, rather do this than shitpost. And it is good for me. So lets talk about how to visualize shilling within things like forum threads (or product review sets which many are)..
I made a stab on it based on the idea of visualizing distributions of sentiment over time so we can see when incongruities of sentiment form (which should be like two opposing clusters) and applied it to a toy situation involving a toy set of product reviews.
Step 1: Build a classifier to provide positive and negative sentiment scores. I used Python and Naive Bayes (though I am addicted to decision and regression tree ensembles now but whatevs this was awhile back) and some sentiment labeled reviews.
Step 2: Simulate a situation where a shill starts posting. Yeah so I just came up with a bunch off the top of my head that were clearly positive and negative
"wonderful wonderful wonderful",
"real neat ",
"best movie of the year",
"the movie was great I loved it",
"best acting favorite",
"awful", <--- shills start
"the movie was amazing",
"terrible movie",
"favorite actor",
"worst movie bad acting"
"most amazing movie ever",
"sucks bad"
Step 3: Score each statement with your classifier which as Naive Bayes does gives both positive and negative sentiment scores.
Step 4: Plot these as colocated coordinates, that is your x axis represents negative sentiment, y axis positive, and the z axis would be time.
And you end up with this!
Observation: Basically it is simple enough to speak for itself but in this toy situation when the shillers start posting a second 'opposing' cluster of sentiment forms.
Problem: Of course applying this to real world messy data. These forum threads are not like my toy data set and might need to do some serious many dimensional data visualization wizardry to bring them out of the noise. Second building a classifier using specifically altcoin discussion data as it is going take some cleverness to label enough of them for supervised learning (we could try unsupervised fun stuff though). We are not a product review site and have all our own lingo.
Next steps: This has potential to go somewhere so I would build a new model using altcoin data on twitter, set up my BCT crawler and get it running because you have to adhere to crawling limits around here, then just start visualizing. Though probably switch to classification trees... they tell stories.