Algebra and insight

This post was originally published at notebook.drmaciver.com.

I was reading up on Markov Chains today, as one does, partially reminding myself about them and partly filling in some gaps in my knowledge.

One bit I was reading about was reversible distributions. If you’ve got some transition matrix (P_{ij}), a reversible distribution for it is a probability distribution (q) over the states such that (q_i P_{ij} = q_j P_{ji}). If you also lack any intuition about what that means, me too at the point at which it’s introduced. The broad intuition is that a Markov chain in a reversible distribution can be “run backwards in time” and you can’t tell the difference, but the main significance is that it’s an easy property to verify that has strong consequences.

Specifically, it’s a theorem that every reversible distribution is stationary. That is, if (q) is reversible then (q_i = j P{ji} q_j) for all (i) - the distribution is not changed by running the Markov chain forwards.

The proof of this presented in the text I was reading is as follows:

We can write (q_i = q_i j P{ij}) (because it’s a standard property of transition matrices that this sum is 1), so (q_i = j q_i P{ij} = j q_j P{ji}), so (q) is stationary.

I can follow this proof of course. It’s very simple algebra. But there’s something about it that feels like a magic trick. I don’t feel like I’ve got any actual insight into the nature of reversible distributions or have any sense of why they must be stationary from this. I’m sure there are plenty of mathematicians who could, and I’m not even sure that I need that insight, but it bugged me.

Thinking about it a bit more lead me to come up with the following alternative proof

One way to think about distributions on markov chains is that you’ve got a certain amount of “stuff” (probability mass) floating around, and at each step that you run the Markov chain, this probability mass flows around the graph.

When you’ve got a probability distribution (q), all of (q_i) flows out of node (i), and (j q_j P{ji}) flows into it. That is, for each node (j), the mass in it flows out to every other node, and the fraction of it that goes to (i) is (P_{ji}).

A distribution is stationary if for all (i), (q_i = j P{ji} q_j), but equally we can write this as (q_i - j P{ji} q_j = 0). The net flow of mass out of (i) is 0.

We can write this flow of mass out of (q_i) as (q_i = j q_i P{ij}) - the sum of the mass flowing to each (j). So the net flow of mass out of (i) is (j P{ij} q_i - P_{ji} q_j). That is, there is a net flow along each edge, let’s say (f_{ij} = P_{ij} q_i - P_{ji} q_j). The probability of (i) remains unchanged if (j f{ij} = 0).

However, the reversability condition is precisely that all of the (f_{ij} = 0). That is, there is no net flow of mass along each edge. As a result, a reversible distribution is necessarily stationary, because a much stronger condition applies: Not only is the total net flow across the edges from (i) equal to 0, it’s 0 along each edge.

This proof is in some sense much worse. It’s longer and involves more algebra rather than less. But after reading this proof I felt like I understood why reversibility obviously implied a distribution was stationary, and also why it was a much stronger condition than being stationary.

I think mathematics - and many other things - benefits hugely from this sort of chewing on a problem and seeing it from different angles until it actually makes sense. If you just know that a result is true, you often find yourself unable to really use it. If you understand at an intuitive level why it’s true, it will stick with you and you can really see how to apply it, including in cases where it doesn’t quite hold but something similar does. Proof without developing insight alongside it is certainly better than no proof, but it’s in some sense fundamentally fragile and untrustworthy, and it’s often better to mull it over until you find the right way to break up the problem to actually allow it to make sense at an emotional level.