Stumbling Through Markov Chains

Education

Mar 21

Written By Stephen

Wherein I talk about random walks in the context of a place we’ve all been before.

A friend of ours showed up to your favorite local haunt, and one of their friends showed up, then an hour later another friend showed up, and the next thing you know they’re drunk. The bartender is also a buddy, so he sets them out the bar and points them in the direction of your apartment, which is on the left side of the street, and sets you going. They’re stumbly, and kinda weaving between the left sidewalk, the right sidewalk, and the middle of the road. And they think to themselves — “What is the probability that I end up in front of my apartment door at the end of this street?” Congratulations! You’ve posed yourself a random walk Markov chain problem.

I already talked a little about Markov chain Monte Carlo modeling in my S&P 500 post, but Markov chains are interesting on their own and worth taking a look at.

The Mathematics of Markov Chains

Suppose a system can be in a finite number of states (yeah I know my S&P 500 post was a continuum of states, the basic concepts are the same but easier to work with), and the system randomly hops from state to state. That’s basically a random walk. If the probability of the new state is purely a function of the current state, that’s a random walk Markov chain.

More formally, given a state $s$, the probability of transitioning to the next state $s’$ is purely a function of $s$, i.e. $P(s’ | s)$, and not the previous states. If we have a finite number of states, then we can represent the probability of transitions as the transition matrix $T_{s’ s} = P(s’ | s)$. Most of our analysis will center on the transition matrix.

The transition matrix

We know that the system has to be in some state, therefore we know that $\sum_{s’} P(s’ | s) = 1$. We also know that $P(s’) = \sum_s P(s’ | s) P(s)$, therefore given a probability vector $\vec{p}$ where each component is a probability of being in a state, we have that $\vec{p}’ = T \vec{p}$. From this, we can ask what the probability of being in state $s’’$ after two steps if we started in state $s$. This is just the sum over probabilities that we’d end up in each intermediate state $s’$:

$$P(s’’ | s) = \sum_{s’} P(s’’ | s’) \times P(s’ | s) = \sum_{s’} T_{s’ s} T_{s’’ s’} $$

which is just $T^2$. Therefore, given an initial probability distribution column vector $\vec{p}_0$, the distribution after $n$ steps is just $\vec{p}_n = T^n \vec{p}_0$. This tells us how to propagate an initial distribution in a random walk.

We also know that $0 \leq P(s’ | s) \leq 1$ always, because what’s a negative probability?, so we know that the elements of the transition matrix satisfy that inequality. It can be shown, using the Gershgorin circle theorem, that this means that the magnitude of the eigenvalues are bounded above by $1$, and in fact since the row vector $\vec{1}$ has eigenvalue $1$ trivially — $\sum_{s’} T_{s’ s} 1_{s} = 1$, therefore $\vec{1}$ is a left eigenvector of $T$ with eigenvalue $1$ — we know that there is some column right eigenvector with eigenvalue $1$, viz.

$$\vec{\pi} = T\vec{\pi}$$

and $\vec{\pi}$ is the stationary distribution.

The stationary distribution

The stationary distribution is the right eigenvector with eigenvalue exactly $1$. We know it exists, and we know all the other eigenvalues are such that $|\lambda| < 1$. It turns out that any initial distribution will converge to the stationary distribution. This is easy enough to understand.

We can decompose any initial distribution vector as a sum of eigenvectors, so that

$$\vec{p}_0 = \sum_i a_i \vec{v}_i$$

for column eigenvectors $\vec{v}_i$. Upon repeated application of the transition matrix we have that

$$\vec{p}_n = T^n \sum_i a_i \vec{v}_i = \sum_i a_i \lambda_i^n \vec{v}_i$$

for each $\lambda_i$. Because we’ve already shown that they have magnitude less than one, $\lambda_i^n \rightarrow 0$ for all the eigenvectors except the stationary distribution. If we wait long enough, a Markov chain always converges to its stationary distribution. We can also estimate how long this takes by looking at the magnitude of the second-largest eigenvalue. Specifically, the system will be in the stationary distribution plus some $\varepsilon$ where $\varepsilon \sim \lambda_2^n$ for the second-largest eigenvalue. If $\lambda_2 = 0.99$, we have to wait longer for the system to converge to its stationary distribution than if $\lambda_2 = 0.1$.

It’s worth noting here that just because the Markov chain has a stationary distribution, that does not mean it is the stationary distribution. For example, consider a system that flips from state A to state B with probability $1$, and then from state B to state A with probability $1$. In this case, the transition matrix is

$$T = \left ( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right )$$

which has two eigenvalues, $\pm 1$. Well, the second eigenvalue having magnitude $-1$ will never go to zero, as we’ve assumed above. Therefore, while $\vec{pi} = (1/2, 1/2)$ is a stationary distribution, no initial distribution will converge to it. However, again from the Gershgorin circle theorem, if the transition matrix has a non-zero diagonal element, then it will have eigenvalues with magnitude $|\lambda| < 1$ and we’ll be able to converge to the stationary distribution.

Our Drunken Companion

Let’s get back to our original problem, our friend who is starting on the left side of the street and needs to end on the left side of the street. Let’s say that our friend really likes to stay on one of the sidewalks, but ends up sometimes stumbling into the street and then ends up back on one of the sidewalks at random.

Formalizing the problem

Let’s say once they’re on the sidewalk, they know they want to be on the left side. So if they’re on the left side, the probability they stay there is 80%. If they end up on the street they just want out of the street, but indecision means they pick somewhere randomly. Meanwhile, if they’re on the right side of the street, they really want to get to the left side, so the probability they leave that side of the street for the sidewalk is 80%.

We can then fill in our full transition matrix as

$$T = \left ( \begin{array}{ccc} 4/5 & 1/3 & 0 \\ 1/5 & 1/3 & 4/5 \\ 0 & 1/3 & 1/5 \end{array} \right )$$

with an initial condition that we know they’re starting out on the left side of the road

$$p_0 = \left ( \begin{array}{c} 1 \\ 0 \\ 0 \end{array} \right )$$

We don’t know yet how long the street is, so it would be good to look at both the long-term probability as well as the step-by-step probability of being on the left sidewalk for fewer steps than “a long time”.

Equilibrium probability

Given enough steps, the probability converges to the equilibrium, for the reasons described above. So, to find this probability, we need to find the eigenvector that corresponds to eigenvalue $1$, which can be done using the numpy.linalg library, and we get the long-term probabilities $P_{\textsf{l}} = .541$, $P_{\textsf{r}} = 0.135$, and $P_{\textsf{s}} = 0.324$.

Reaching Equilibrium

We aren’t necessarily interested in the equilibrium distribution if we aren’t far enough away, so the probability our friend ends up on the left side, given that they started on the left side, will depend on how far their walk is. We can model the probability of all three through repeated application of the transition matrix, and in particular our friend has better odds of finishing on the left sidewalk if the walk is shorter. The probability of being in each state at step $k$ is given by the transition matrix times the probability distribution at step $k-1$, so we can watch the distribution evolve as

$$\vec{p}_k = T \vec{p}_{k-1}$$

and therefore that

$$\vec{p}_k = T^k \vec{p}_0$$

which, when we model, gives this evolution:

Evolution of the distribution from initially starting on the left sidewalk. Dashed lines indicate the equilibrium probability.

The three eigenvalues for this system are $1$, $0.62$, and $-0.28$, so we can estimate the correction to the equilibrium distribution as scaling as $.62^k$, which by step 4 is $.15$, so by step 4 we have about a 15% correction to the equilibrium distribution.

Finishing Remarks

We’ve looked at three states, and talked about countable finite states and matrix representations. However, this all works in a continuum space, with the structure of the evolution of the distribution coming from the Chapman-Kolmogorov equation, which has a similar structure to iterated Green’s function integrals, which is how we end up with path integrals in quantum mechanics. This means we can map the evolution of a continuum of states (or even finite states for finite-state quantum systems) in a Markov chain to an imaginary time Schrödinger equation-like thing. So all that time I spent taking field theory in grad school has adequately prepared me for infinite-dimensional Markov chains.

$\setCounter{0}$

markov chainpath integralrandom walkProbabilityBayes' Theorem

Stephen