A hidden markov model for sentiment prediction and price forecasting

The following text is a translation of one of my comments: https://bitcointalk.org/index.php?topic=4586924.msg41380372#msg41380372. The translation was done quickly but will improve with time.

Prerequisites

Installation of R
Installation of necessary R-packages

Introduction

Example with ideal and manipulated die
Discussion of resulting variables
Comparison and selection of models
Limit of a model

Hidden Markov Models for sentiment analysis

A price-return Hidden Markov Model
Robustness analysis
Other variants

Trading by means of a Hidden Markov Model
Price forecasting with a Markov Chain
Discussion
Outlook
References

Prerequisites

Installation of R

R is a free programming language, which specializes on statistical computing (https://en.wikipedia.org/wiki/R_(programming_language)):

https://www.r-project.org/

This is the only software package needed so that the following code snippets should work by copy and paste. Additionally, it makes sense to install Rstudio:

https://www.rstudio.com/

Rstudio is a graphical user interface and programming environment for R and simplifies programming in R as well as the managment of data.

Installation of necessary R-packages

The following code, executed in R, installs packages, which are necessary or might be useful at later timepoints.

Code:

install.packages(c("zoo", "anytime", "data.table", "depmixS4"));

Packages are loaded in R as follows:

Code:

library(zoo);
library(anytime);
library(data.table);
library(depmixS4);

Introduction

Example with ideal and manipulated die

The following example is chosen in such a way, that it is analogous to chart analysis.

Suppose you take part in the Mensch Ärgere Dich Nicht World Championship (DNGAWC in the following, which stands for the english translation: Do Not Get Angry World Championchip). For this game you need a die. If you throw a six, you may throw the die again. Furthermore, the player with the six has the choice to come out or throw again normally. Thus, throwing a six is an advantage. For an ideal die, the probability is exactly 1/6, as well as for the other numbers. You could try to help luck via using a manipulated die, which looks as similar as possible to the official die of the world championship. Hence, it would be good, if the organizers of the DNGAWC would be able to detect whether an official die was used or not. Cheating players would then be disqualified from the championship.

At the beginning we consider the ideal die and compare it with the manipulated die. The manipulated die should increase the probability of throwing a six. We assume in this example, that the probability for a six with the manipulated die is 2/7 and for the other numbers 1/7 each. We now throw both dice very often and note down the sequence of the thrown numbers:

Code:

library(zoo);
library(depmixS4);

# ideal die
x_ideal <- floor(runif(1000000)*6)+1; # throw very often
x_manip <- floor(runif(1000000)*7)+1; x_manip[x_manip==7] <- 6; # throw very often

To get a feeling for the distribution of the numbers, we combine 500 throws via a moving window on both sequences respectively (also compare other window sizes). We note the mean value for each window and plot the corresponding mean value distribution for both sequences:

Code:

y_ideal <- rollapply(x_ideal, width=500, FUN=mean);  # combine 500 throws
h_ideal <- hist(y_ideal, 50, plot=FALSE); # create histogram
h_ideal$density <- h_ideal$counts/sum(h_ideal$counts);  # normalize density

y_manip <- rollapply(x_manip, width=500, FUN=mean);  # combine 500 throws
h_manip <- hist(y_manip, 50, plot=FALSE); # create histogram
h_manip$density <- h_manip$counts/sum(h_manip$counts);  # normalize density

plot(h_ideal, col=rgb(0,0,1,1/4), xlim=c(3.3, 4.1), freq=FALSE, main="Mean of 500 throws with\nideal (left) and manipulated die (right)", xlab="mean", ylab="frequency");  # first histogram
plot(h_manip, col=rgb(1,0,0,1/4), xlim=c(3.3, 4.1), freq=FALSE, add=TRUE); # second histogram

z <- seq(3.3, 4.1, 0.001); # points on x-axis
lines(z, dnorm(z, mean(y_ideal), sqrt(var(y_ideal)))/100, col="blue");  # plot normal distribution with parameters from simulation
lines(z, dnorm(z, mean(y_manip), sqrt(var(y_manip)))/100, col="red");  # plot normal distribution with parameters from simulation
abline(v=mean(y_ideal), col="blue"); # mark mean
abline(v=mean(y_manip), col="red"); # mark mean

The result is shown in the plot below:

Obviously, the two distributions are gaussian. If a sequence was created with the ideal die, then the mean of 500 throws should come from the left distribution. If a sequence was generated with the manipulated die, then the mean comes from the right distribution. If a sequence was created via both dice, then the mean should come either from one or the other distribution. Whether a sequence was created with one or two different dice, i.e., whether the dice were exchanged during the game, can now be detected with a Hidden Markov Model (HMM). However, an important requirement for this example is, that the number of consecutive throws with each die is large (e.g., 10000) compared to the window size (500).

The HMM modelling one die consists of one node and an one edge. In the case of the ideal die, the node corresponds to the upper blue normal distribution. The node generates numbers such that the distribution of the mean from 500 die throws follows the blue normal distribution. Since there is only one edge with the probability P = 1.0, the new state of the HMM after each roll MUST be the old one:

If the ideal die is exchanged by a manipulated die during the game, the corresponding HMM consists of at least two nodes. A node that generates the distribution of the ideal die and a node that generates the distribution of the manipulated die:

The edges with the probabilities P(blue to blue), P(blue to red), P(red to red) and P(red to blue) depend on how often the dice were exchanged during the game. Here, the sequence of numbers should be long enough that more than two exchanges (state changes) took place! The more the better.

An HMM whose parameters are known is also called Markov chain. A Markov chain, interpreted as a model for a given (stochastic) process, mimics that process mathematically and thus represents something like a simulation of the process. In our example, this would be the simulation of the numbers thrown by a fair player or the numbers thrown by a cheater during the DNGAWC.

Thus, an HMM is an automaton, i.e., a graph with several nodes (states) and edges (state changes). One of the nodes is special since it determines the state the automaton currently is in. The edges are labeled with probabilities so that the sum of the outgoing edges sums up to the probability of P = 1.0. The automaton then jumps in discrete steps from node to node along the edges, according to the probabilities indicated at the edges. Since the sum of the probabilities at the outgoing edges is equal to 1.0, the automaton changes its state in each step, though the new state can be the same as the old one. If the HMM is in state A and with the next jump, the state changes to B (with A! = B), then you have a (real) state change.

In the case of an HMM, the parameters of the model, i.e. the probabilities at the edges of the automaton, are not known in the beginning and must be deduced from given data. This is also called fitting the model to the data and the result is simply called the fitted HMM. Having found the optimal parameters, one can now compute the most probable path over the nodes over time through the graph. The sequence of nodes that the automaton passes through is called hidden states (hidden/unknown states). In our example, the data ONLY consist of the thrown die numbers. The hidden states correspond to the use of the dice, i.e. when an ideal die and when a manipulated die was used in the sequence of thrown numbers.

We now generate a simulated sequence of thrown die numbers, which satisfies our requirements (many thrown die numbers in a row by the same die and many dice exchanges), and try to detect the dice exchanges in the resulting number sequence by means of a fitted HMM:

Code:

set.seed(1); # set random set for reproducibility

# generate sequence with ideal and manipulated die
a <- floor(runif(10000)*6)+1;
b <- floor(runif(50000)*7)+1; b[b==7] <- 6;
c <- floor(runif(20000)*6)+1;
d <- floor(runif(10000)*7)+1; d[d==7] <- 6;
e <- floor(runif(30000)*6)+1;
f <- floor(runif(20000)*7)+1; e[e==7] <- 6;
g <- floor(runif(40000)*6)+1;

sequence <- c(a,b,c,d,e,f,g);
mseq <- as.data.frame(t(t(rollapply(sequence, width=500, FUN=mean))));
colnames(mseq) <- c("MEAN");

The generation of the HMM, the fit of the model to the data and the extraction of relevant data is a three liner in R:

Code:

# generate and fit Hidden Markov Modell
hmm <- depmix(MEAN ~ 1, family = gaussian(), nstates = 2, data=mseq);
hmmfit <- fit(hmm, verbose = FALSE);
post_probs <- posterior(hmmfit);

For the generation of a meaningful plot, we need more code:

Code:

dev.new(width=19, height=3.5); # create plot
matplot(post_probs[,-1], type="l", col=c("blue", "red"), lty = c(1,1), xaxt="n"); # plot posterior probabilities

axis(1, at = 10000*(0:18), labels = 10000*(0:18)); # normalize x-axis
legend("left", legend = c("ideal", "gezinkt"), col = c("blue", "red"), lty = c(1,1), lwd = 1 , xpd = T ); # legend
title("Würfel-HMM mit zwei Zuständen"); # titel

# generate sequence of HMM states
for (i in 1:nrow(post_probs)){rect(i-1, 0, i, -0.1, col=c("blue", "red")[post_probs[i,1]], border=NA)}

# posterior probabilities
summary(hmmfit);
hmmfit@transition;

The result should look as follows:

On the x-axis, the sequence of states is shown as colored horizontal band. The curves in the plot represent the (a posteriori) probability for the corresponding die at the given time point. Note that the HMM does not reproduce the states perfectly (see also other window sizes). This is due to statistical fluctuations, which means that the blue distribution may look like the red by coincidence (and vice versa). Because of the window, of course, there are less than 180,000 states (179501). Due to numerical inaccuracies, the real transition probabilities can only be accessed using the command "hmmfit@transition". The command "summary(hmmfit)" is not sufficient, because it apparently rounds the probabilities! The corresponding HMM is then given by the following graph:

Discussion of resulting variables

Todo: Discussion of resulting variables

Comparison and selection of models

If the organizers suspect a player to cheat during the DNGAWC, then they must also be able to quantify their suspicion. They must be able to quantify how likely it is that the suspect uses a manipulated rather than an ideal die. In the simplest case, one compares two different models of the suspect: The null hypothesis (H0) is, that the suspect uses an ideal cube; The alternative hypothesis (H1) is, that the suspect uses an additional manipulated die. H0 is discarded if H1 is much more likely. Of course, one can still generate additional models of increasing complexity (H2, ..., Hn), e.g. to model the assumption that the suspect has used multiple different manipulated dice with different parameters. One of those models must be selected in such a way that the probability is minimized to exclude an innocent player from the world championship.

Just as graphs can be varied in their parameters, i.e. in the number of nodes, edges, and thus in their topology (structure), different HMMs can be generated (H0, ..., Hn) and can be fitted to the same data, assuming, that they model reality, respectively. Each fitted HMM will have a certain likelihood. The likelihood here is the "probability" of the HMM given the data and it can be interpreted as a kind of probability. There are also other concepts that are analogous to the likelihood. The best known are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Given an unfitted HMM, there is virtually always only one fit of the (hidden) states (or a behavior of the HMM automaton over time) to the data, which is the most likely. However, the other fitted HMMs might be similarly likely, even though the absolut value of their likelihood is worse compared to the best fit. In such a case you have to decide which model to choose for further analysis to prevent overfitting.

The naive approach consists of simply choosing the HMM that has the highest likelihood. However, this approach has the problem that a model with more parameters usually ALWAYS has the higher likelihood. At the same time, however, the model loses its general meaning (overfitting) or becomes more difficult to interpret. This means for the set of HMMs to be examined, that the corresponding model parameters, i.e. the number of states, are chosen to be as "reasonable" as possible. However, even in this scenario, the choice of the HMM with highest likelihood can be poor, since simpler models with fewer parameters, that is, HMMs with fewer states, may have "similar" high likelihoods. In this situation, one should choose the HMM with fewest states from the set of HMMs with "similar" high likelihood (Ockham's Razor). How to do that in detail, you can find here: https://stats.stackexchange.com/questions/81427/aic-guidelines-in-model-selection.

Even with new data, the likelihood of one and the same HMM can change, creating a completely different fit. The Expectation-Maximization algorithm (EM algorithm) guarantees to find local maxima but it does not necessarily find the global maximum (sequence of the calculation of an HMM fit: EM algorithm and then Viterbi algorithm). If one fits a model with different random seeds, one might end in different local maxima. In this situation, the random seed can be interpreted as a new parameter and this case can be treated as described in the last paragraph. (Source: https://bitcointalk.org/index.php?topic=26136.msg45063876#msg45063876)

Limits of a model

I do not want to talk much about the limitations of a model. However, I link a documentation where imho everything important is said (The Midas Formula). Although the video does not explicitly refer to the limits of models, but the basic message should be clear in the end. It is important to fully watch the video:

https://vimeo.com/28554862

Hidden Markov Models for sentiment analysis

A price-return Hidden Markov Model

The return of a specific time interval t, e.g. of a specific week, is the difference between closing price and opening price divided by the closing price:

Code:

return[t] <- (x$CLOSE_PRICE[t] - x$OPEN_PRICE[t])/x$CLOSE_PRICE[t]   # in R "<-" corresponds to ":=" or "=" in other programming languages

If the closing price is greater than the opening price, the return is positive, if the opening price is greater than the closing price, the return is negative, and if the closing price is equal to the opening price, the return is zero. The return of an asset that continuously varies by a certain average (e.g., like tethers around US $ 1), with no medium to long term swings down or up, can be modeled with a (normal) distribution of mean zero. The return of an asset where, on the other hand, there are additional market situations in which the price also falls or rises in the long term (e.g. bear and bull market in Bitcoin), it makes sense to model these three market situations with at least three different distributions: the distribution of a bear market has a negative mean, the distribution of a bull market has a positive mean and the distribution of a sideways movement has a mean of zero. Those three distributions are equivalent to three different dice at the DNGAWC: a manipulated die where the probability of throwing a six is less than 1/6, an ideal die and a manipulated die where the probability of throwing a six is greater than 1/6.

At https://github.com/trantute/sentiment-hmm, two exemplary HMMs are implemented. The first HMM models four states: bearish, sideways, bullish and bubble (see the plot below). The second HMM models five states: dead, bearish, sideways, bullish and bubble.

The states of the HMM over time are shown as colored horizontal band on the x-axis in the plot above. The posterior probabilities as well as the logarithmized price (normalized to 1.0) and the logarithmized MA203 (normalized to 1.0) are shown as curves in the plot. The sentiment is indicated by color, as described above as well as defined in the legend on the left side in the plot. The corresponding graph is given below:

Robustness analysis

The question arises how trustworthy the sentiment prediction for the current week actually is. Simulations with historical data have shown that the predictions are relatively stable if the market is not volatile in the medium to long term (sideways phase) or in a medium to long-term downward or upward trend (not shown). However, simulations of historical data as well as real data have also shown that an extreme change in price, which has a significant impact on the weekly return, can subsequently change the sentiment prediction. Below, the fits of seven consecutive weeks are shown:

The states of the HMM over time are shown as colored horizontal band on the x-axis in the plot above. As can be seen on the right side in the plot above, a state is unstable, in particular, if the HMM has just changed its state recently, or if different states for the current week have a similar likelihood. The posterior probabilities for the different states (the curves in the plot) can provide information about this. If all states have posterior probabilities not equal to 0.0 - as in the upper plot on the right - then the current state is more unstable than if one state has the posterior probability close to 1.0 and the others close to 0.0.

Other variants

The return as proxy for the sentiment is only one possibility for an HMM. Other variables can be used too, insofar they show a correlation with the sentiment. E.g. the comments per time interval at bitcointalk.org could be used as proxy. Corresponding code can also be found at https://github.com/trantute/sentiment-hmm.

Trading by means of a Hidden Markov Model

If the sentiment is known or if one is able to estimate the sentiment sufficiently accurate, then this knowledge can be used to act accordingly. The idea is to go long at the beginning of an uptrend and short at the beginning of a downtrend. For the plots shown above, this means for example: bearish (red) => short; sideways (black) => long; bullish (green) => long; and bubble (violett) => long. In addition, the plot contains the logarithmized Bitcoin price (US $, Bitstamp, normalized to 1.0) for the corresponding time points (blue) and the MA203 (Moving average for 203 days = 29 weeks, orange) which corresponds approximately to the MA200 but is easier to calculate.

Of course, trading only makes sense, if the sentiment changes from black, green or purple to red or vice versa. For this you have to keep an eye on two weeks each, i.e. suppose today is a Monday, yesterday was Sunday and the HMM has predicted a change of state relative to the previous Sunday. Now you exchange your asset according to the rules mentioned. Namely to the opening price of the current week, so the current Monday. (Source: https://bitcointalk.org/index.php?topic=26136.msg43234811#msg43234811)

Todo: Code

Todo: Test with historic data

Exchange conditions can be additionally optimized. The example from above is quite naive: 0:100, 100:0, 100:0 and 100:0. This are just four parameters. One could search the space of exchange ratios using historic data for finding the ratios maximizing the yield. E.g. steps of 0.1 result in only 10000 possibilities and one choses the ratios which maximizes the yield (in dollar or BTC). In particular, assuming that one can ONLY go long or short, i.e. all in or all out, the computational effort is manageable, since the exchange ratios would then be binary variables. With n = 4 states there would be only 2^4 = 16 possibilities. Such a method may also be useful in finding the best parameter set of the HMM, but with the risk that the model loses generality by overfitting on historical data. (Source: https://bitcointalk.org/index.php?topic=26136.msg43256510#msg43256510)

Price forecasting with a Markov Chain

Todo: Explanation, how to do this. See also https://github.com/trantute/sentiment-hmm for a 4-state-HMM. Also simulations with historic data are needed.

Discussion

It is important to note, that the characteristics of the underlying probability distributions might continuously change over time (phase transition)! For instance, the return probability distribution of a bearish sentiment might be different in 2014 and 2018. As a result, over long time periods, one has to experiment, whether additional states, which model the same sentiment at different points in time but with different probability distributions, improve the power of the model.

Outlook

Todo: Ideas come in here. E.g., price/return forecast via a fitted HMM/Markov chain.

References

https://www.wikipedia.org/
https://api.bitcoincharts.com/v1/csv/
http://web.stanford.edu/class/stats366/exs/HMM1.html
https://www.quantstart.com/articles/hidden-markov-models-for-regime-detection-using-r
https://www.fool.com/knowledge-center/how-to-calculate-return-on-indices-in-a-stock-mark.aspx
https://stats.stackexchange.com/questions/81427/aic-guidelines-in-model-selection
https://github.com/trantute/sentiment-hmm