Yes linking by unique ID is enough and we dont need everything stored in one huge table. We just need to join the information those multiple tables have in common.
small consideration on my side: sentiment analysis might not be the easiest way to start with this. Look at those false positives that sentiment analysis software picks up. I would also imagine this is the place where most people have a very strong opinion, like on all social channels. Perhaps the first victories should be won through numbers alone.
As for the features to use, well there could be some obvious ones for us humans, which we then need to translate into a formula, so that a simple script can pick it up.
Let us talk about how we could approach this (and please post your thoughts below):
I think one of the easiest one could be pump and dump schemes. Why do I think so? Is this not defined as a sharp drop after a time of growth?
So, lets say we have open close prices, we calculate the percentage of increase. OK and now we generate our pump and dump feature: if the coin lost lets say 80% of its value over the time span of one day, I consider that an orchestrated dump. I would create a new variable, that records such an event with a simple 1. Then, I would count the number of times this has happened over the whole lifetime of this coin. And because I dont know whether this is normal or not, I would then compare this result with the other coins I have in my database. Finally, I would make a barchart and sort them by number of occurences to find out which coins suffered from being dumped and whether there are serial pump and dump schemes on particular coins. Perhaps you will think 80% is a bit too much, and you are right. Another variable could be dump 70%, 60%, 50%... and voila you have a lot of new features.
Please keep this discussion as interactive as possible and I will try to execute your ideas.
1. How do you distinguish pumps/dumps with actual surges and delines due to interest? The hand coded rules you suggest are a little arbitrarily chosen.
There are places where both supervised and unsupervised learning could help. Too bad we don't have labeled data for training. Clustering might be fun.
When we did GPS stream analysis for seismic event detection the movement time series were clustered into groups which defined based patterns of movement, that was a fun one.
Many methods for working with and comparing time series, know any good distance measures? Time warping helps but might not be best because we don't want to match a slow growth/decline pattern with a pump/dump pattern.
2. Identifying 'pump coins' using these features and additional machine learning would be awesome. If you have hand coded rules or a machine learning generated model for pump/dump detection then we can use the resulting time series of 'pump/dump', 'natural growth', etc as features. Again having coins labeled as 'pump' coins would be very very useful. Supervised and unsupervised approaches could be fruitful.
3. We have another strategy for identifying shill/scam/pumpdump coins! I will email you about this as we are still working it out. It might be useful for identifying shill coins as well as shillers.
4. I have experience doing sentiment analysis. We built our own system for detecting the emotional polarity of texts which could be useful. It is not planned for iteration 1 though so don't expect it.
5. One problem with sentiment analysis is detecting a time window where there are both positive and negative sentiment expresisons. Don't just want to say that section is neutral, as it happens all the time when shilling occurs as well as naturally.
6. Some legit coins are pumped and dumped. Doesn't make them a scam and some eventually do well if you are along term investor.
7. Volume along with price is important. One thing that happens pre-pump is an number of small purchases by the pumpers which may or may not become a useful features.
8. A hype score would be a useful feature. Most pumpers shill along with the pump. This is my specialty right now, shilling detection.
9. I am far far far from an expert. I got a graduate degree but have many gaps. I welcome any knowledge and advice.
10. we will work on different approaches. We will set you up to do with the data as you please.