My favourite programming language Haskell has a nice library for doing calculations with probablility distributions, called probability.
The library was originally written by Martin Ewig and Steve Kollmansberger and then extended and packaged for convenient installation by Henning Thieleman. Its main idea for the implementation is to use the fact that “probability distributions form a monad” (from a paper by the authors of the library ). This post aims at translating this claim into standard probability notation and showing how it is used in the library.

Monad is a notion in category theory which is defined as a monoid in the category of endofunctors. That sounds so cool that I really wish I knew what it means. The only thing I know now is that the “monoid” here is not the standard monoid from mathematics founded on set theory, but rather its category theory analogue. So what does the claim “probability distributions form a monad” mean Haskell-wise?

Let’s load the library and play with it.

Prelude>:m +Numeric.Probability.Distribution

In the probability library distributions are (essentially) represented as lists of pairs (element, probability).
There are a couple of functions defined that provide means of generating some often used distributions. For example “choose p a b” will generate a distribution that assigns probability p to a value a and (1-p) to b.

Prelude Numeric.Probability.Distribution> choose 0.3 0 1

fromFreqs [(1,0.7),(0,0.3)]

Another example is the function “uniform” that assigns the same probability to all elements of a list.

Prelude Numeric.Probability.Distribution> uniform [0,1,2,3]

fromFreqs [(0,0.25),(1,0.25),(2,0.25),(3,0.25)]

There is also a function “certainly” that constructs a point mass

Prelude Numeric.Probability.Distribution> certainly 5

fromFreqs [(5,1)]

Now what about that monad thing? Let’s think about distributions on a given finite set $X$ as nonnegative functions that sum up to 1 and let $Dist(X)$ be the set of such distributions. So, we have $Dist(X) = \{ P: X \rightarrow [0,1] | \sum_{x\in X} P(x) = 1\}$. Suppose now we have two finite sets $A,B$. We can define a binary operation $\leadsto:(Dist(A) \times (A \rightarrow Dist(B))) \rightarrow Dist(B)$ that takes a distribution on $A$ and a function that assigns a distribution on $B$ to every element of $A$, and creates a distribution on $B$ in a (quite) natural way:

$(P \leadsto f)(b) = \sum_{a\in A} P(a)\cdot (f(a))(b)$ for any $P\in Dist(A)$ and $f: A\rightarrow Dist(B)$.

By taking the sum over all $b \in B$ it is easy to check that $P \leadsto f$ is indeed a distribution on $B$. It can be interpreted as the distribution of the result of selecting first $a \in A$ according to distribution $P$ and then selecting from the set $B$ according to the distribution given by $f(a)$.

We can use the “$\leadsto$” (corresponding to Haskell’s “>>=”) operation to construct joint distributions. For example to generate o model of two coin tosses we can do

Prelude Numeric.Probability.Distribution>
(choose 0.5 0 1) >>= (\x -> choose 0.5 (x,0) (x,1))

fromFreqs [((0,0),0.25),((0,1),0.25),((1,0),0.25),((1,1),0.25)]

Now suppose we have three finite sets $A,B,C$ and corresponding operations in

$(Dist(A) \times (A \rightarrow Dist(B))) \rightarrow Dist(B)$

and

$(Dist(B) \times (B \rightarrow Dist(C))) \rightarrow Dist(C)$

that we will both denote $\leadsto$. Then for all $P \in Dist(A)$, $f: A\rightarrow Dist(B)$ and $g \in B\rightarrow Dist(C)$ we have the identity $((P \leadsto f) \leadsto g) = P \leadsto h$, where $h: A \rightarrow Dist(C)$ is defined as $h(a) = f(a) \leadsto g,\ a\in A$.
Note that if $a \in A$ then $h(a)$ is a distribution on $C$ and $(h(a)) (c) = \sum_{b\in B} (f(a))(b)\cdot (g(b))(c)$ for $c \in C$.

Let’s expand both sides to see that the identity really holds. Fix $c \in C$. We want to see that the values of the distributionsĀ  $(P \leadsto f) \leadsto g$ and $P \leadsto h$ are the same on $c$.

The left hand side evaluates to

$\sum_{b\in B} (P \leadsto f)(b) \cdot (g(b))(c) = \sum_{b\in B} \sum_{a\in A} P(a) \cdot (f(a))(b)\cdot (g(b))(c)$

and on the right hand side we have

$\sum_{a\in A} P(a)\cdot (h(a))(c) = \sum_{a\in A} P(a)\cdot \sum_{b\in B} (f(a))(b)\cdot (g(b))(c)$.

So they are the same.

The fact that we have such quasi-associativity (plus two other conditions that are trivial to check) allows to claim that the type of distributions forms a monad. In Haskell this opens up a whole world of possibilities. There are hundreds of functions related to monads in Haskell libraries. One can assume that each one does something useful when applied to probabilistic calculations and the only problem is to figure out what that useful thing is.
For example the function “replicateM” creates distributions of sequences of independent identically distributed random variables:

Prelude Numeric.Probability.Distribution> :m + Control.Monad
replicateM 4 (choose 0.5 0 1)

fromFreqs [([0,0,0,0],6.25e-2),([0,0,0,1],6.25e-2),([0,0,1,0],6.25e-2),
([0,0,1,1],6.25e-2),([0,1,0,0],6.25e-2),([0,1,0,1],6.25e-2),([0,1,1,0],
6.25e-2),([0,1,1,1],6.25e-2),([1,0,0,0],6.25e-2),([1,0,0,1],6.25e-2),
([1,0,1,0],6.25e-2),([1,0,1,1],6.25e-2),([1,1,0,0],6.25e-2),
([1,1,0,1],6.25e-2),([1,1,1,0],6.25e-2),([1,1,1,1],6.25e-2)]

One can define a function that calculates the sequence of expectations of maximum of $1,2,...n$ i.i.d random variables with common distribution $d$ as follows:

expmaxseq n d = [D.expected $D.map maximum$ replicateM k d | k <- [1..n]]

For throwing dice we get the following expectations of maximum for 1,2,…, 7 dice:

ProbLib Numeric.Probability.Distribution> expmaxseq 7 (uniform [1..6])

[3.5,4.472222222222223,4.958333333333335,5.2445987654321105,
5.4309413580246915,5.560292352536542,5.65411736969124]
and Steve Kollmansberger

Tags: ,

1. wren ng thornton Says:

Actually, that’s exactly the “monoid” for which monads (oid != ad) are monoids on the category of endofunctors.

As for what it means, start by considering the set containing all functions (ion != or) that take some type, T, back into itself. This looping back is all that “endo” means. Now, there are two interesting things with this set of functions: we can compose two functions in this set and get another one, and there’s a unique identity function in the set. Function composition is an associative binop since (f . g) . h == f . (g . h). And the identity function is an identity element since f . id == id . f == f. Thus this set, with the appropriate composition operator and distinguished identity function, forms a monoid in the traditional algebraic sense.

A functor is a generalization on a function. Any intro on category theory will give the basic details, so I’ll omit them here. Once we generalize the above so that the “set” is a category and each endofunction is an endofunctor, a monad is what we get by choosing different definitions of “composition” for the above example and select the appropriate identity element for that composition. [Note that composition of morphisms in the CoE and the identity morphisms of the CoE, should not be confused with the composition and identity element that form the monoid.] In Haskell terms, for each monad the identity element is return and the binop is join (which can be taken as a primitive instead of >>=).

For more, see this post by Dan Piponi.

2. slawekk Says:

Thanks for the explanation. Yes of course a set $X\rightarrow Y$ with composition forms a monoid. The problem is that if you say

start by considering the set containing all functions (ion != or) that take some type

for someone living in ZFC universe the reading stops right here. The set theory does not have the notion of type.

This is the problem with all expositions of category theory I have seen. They use category theory terminology to explain category theory. That would be OK if category theory was really foundational. We would have axioms preferably implemented in some formal proof language so that we could play with them, prove some theorems and work on the intuition reading some accompanying informal text. No such thing exists to my knowledge. It is possible to talk about category theory inside ZF plus an additional axiom (TarskiāGrothendieck set theory), but I don’t know about any informal text that presents category theory from this point of view.

3. wren ng thornton Says:

You don’t really need a notion of types. In ZFC (or even CT) all you need to say is that you have a set (or category) of thingees. As far as the theory goes, we don’t care what a thingee really is. The only time it matters is when we want to ascribe a certain interpretation to the set/category, such as when we want to prove that it does indeed have the structure we’re imputing to it.

But this is run of the mill stuff in ZFC. Consider how one would define a relation on a set. Theoretically we can describe relations in ZFC terminology, and we can certainly discuss the properties of one if we have it, but in order to actually have a real live relation on a real live set we need to appeal to some oracle which knows what thingees actually are (unless we’re dealing with simple relations that never inspect their arguments). Only these implementation oracles need to know about “types”, but even then we can usually get away without it. In my example we don’t actually have to know what a type is, we only need to know enough to be able to discuss that a thingee happens to be a function (from some carrier back to the same). Set theory has functions, it just describes them as being from one set into another. Functions can be elements in a set or a model just like anything else, nothing to see here.

Indeed, category theory is foundational just as much as ZFC is. It’s just that when CT discusses “sets with additional structure” it’s interested in that structure rather than in what’s beneath it. Given the focus on structure-preserving transformations, what’s actually beneath frequently doesn’t matter anyhow. For someone steeped in set theory it feels like everything is turned on its ear because set theory is so very concrete and so very interested in the little elements rather than in the big structure. But it’s in focusing on the big structure irrespective of the particular elements where all the revelations of CT come from.

As far as proof libraries go, I’m not familiar enough with the libraries for theorem provers to offer any suggestions. However, there are things like the category-extras package in Haskell which implements a great deal of theory, including features like passing around proofs of properties on a type. Perhaps that’s too far across the Curry–Howard isomorphism for what you’re interested in, but it’s out there. (Much of the attendant documentation was on the developer’s blog which crashed a while back.) There’s also various work on generating free theorems, which is more type theory than category theory proper, though there’re deep connections between them.

4. daveshields Says:

While Haskell is powerful, SETL, which has finite sets as the fundamental date type, is of course more powerful.

I’m writing a new SETL compiler now. When done, it will have even more expressive power than Haskell.

It will also, I promise, be *much* faster.

http://daveshields.wordpress.com

See especially the post on Moore’s Law and Software. All the rest is detail.

thanks,dave

• slawekk Says:

Very interesting. I was always wandering why there is no programming language based on set theory and it turns out there is one! Please let me know when the compiler is ready for beta testing.

• anon Says:

From what I can tell, SETL is a dinamically typed language with sets and tuples as its basic data types. It may be more powerful than Haskell in some sense, but we already tried that kind of thing with LISP and look whay happened…

In my view, the best candidates for “very high level” languages with a mathematical foundation are dependently typed languages–such as Coq, Agda and Epigram. These are conceptually very similar to Haskell, and Coq programs can be trivially translated to Haskell or ML code.

5. Markov chains « Formalized Mathematics Says:

[…] Formalized Mathematics Just another WordPress.com weblog « Probability monad […]