This is an appendix with some technical detail for the article *Tails, Bayes loses (and invents data assimilation in the process)* You'll need to have some knowledge of probability theory and statistics to understand this.

### Technical explanation

In trying to make use of data which has errors in it to estimate an event , we want to choose the event which has the maximum chance of happening given the data that we have observed. This is often called the *maximum likelihood estimate* or MLE.

We can do this using Bayes’ theorem. We already learned that if the probability of an event happening is and the probability of the data being observed given the event has happened is then the probability of that event happening, given the observation of the data, is given by Bayes’ theorem and is

Here is the probability of recording the data from the system we are studying, which can occur whether or not we see the event.

To find the MLE we want to choose the estimate of the event which makes as large as possible. It follows from Bayes’ theorem that we want to find the event for which the *Bayesian estimate* of the MLE given by

is as large as possible.

Now let’s suppose that the event is a one-dimensional variable (so it takes on values that are single numbers, such as temperature) which has a mean value , which is our prior estimate for , and that its errors have variance . We also suppose that the measured data is a one-dimensional variable which has mean (so that the data is an unbiased estimate of the event value) and its errors have variance

We now consider the special, but very common, case which arises when both the data errors and the errors in the event are independent *Gaussian random variables* (also called Normal random variables). Most realistic examples are like this. In this case we can estimate and by the expressions

and

It then follows from formula for the the Bayesian estimate above that

where

We want to make as large as possible. We do this by finding the value of which makes as small as possible. The mimimiser of is then our MLE of the value of

For this one -dimensional example where is a single variable we can find the minimum of exactly. In particular we have,

as the best estimate for E.

See the following example for how this works in the case of the thermometer problem and how much we need to nudge our estimates by.

### An explicit example

Suppose that we want to predict the temperature of a room. We write for the true temperature of the room.

Using our knowledge of the weather over the last few days we make an unbiased prior prediction of what we think the temperature should be. (As an example, a simple prediction of tomorrow’s weather is given by today’s weather. This prediction is right 70% of the time!) The (prior) prediction of the temperature has an error . The error has variance

We then look at the thermometer and it is recording a temperature of The thermometer measurement says that the room is cold. We suspect that it may be wrong because everyone in the room is dressed in summer clothing and are fanning themselves to keep cool.

The data error has variance which from the above considerations is likely to be large. We will assume that the prediction error and the data error are independent random variables, which each follow the normal distribution (find out what this means here).

To nudge the prediction in the direction of the data we construct a new measurement, the analysis, given by

The *nudging parameter* controls how much we nudge the prediction in the direction of the data, and we want to choose so that the error in the analysis is has as small a variance as possible. This error is given by and a little algebra shows us that

We can see that this is made up of the prediction error and the data error which we have assumed to be independent random variables. A standard result in probability theory then states that the variance of

is given by

We then want to find the value of which minimises If you know a bit of calculus, you’ll know that you can do this by differentiating with respect to and setting the result to zero. This gives the optimal value of as

and

This is our best estimate of the temperature of the room which nicely combines the prediction and the data into one formula. As a sanity check, if the data error variance is large compared to the prediction error variance (as we suspect from looking at the room occupants), then will be close to one, and we place much more reliance on the prediction than on the data. Also if the data error variance and the prediction error variance are the same (so that the prediction is as good an estimate as the data) then we have

which looks very reasonable.

Back to the article *Tails, Bayes loses (and invents data assimilation in the process).*

### About this article

Chris Budd.

Chris Budd OBE is Professor of Applied Mathematics at the University of Bath, Vice President of the Institute of Mathematics and its Applications, Chair of Mathematics for the Royal Institution and an honorary fellow of the British Science Association. He is particularly interested in applying mathematics to the real world and promoting the public understanding of mathematics.

He has co-written the popular mathematics book *Mathematics Galore!*, published by Oxford University Press, with C. Sangwin, and features in the book *50 Visions of Mathematics* ed. Sam Parc.