Here's a strange fact: if you look up some numbers, for example the numbers in your tax return, population sizes of Chinese provinces, or the length of the world's rivers, then most likely around 30% of these numbers start with the digit 1, around 18% start with the digit 2, 12.5% start with a 3, and so on, all the way to 9 (which only heads up around 5% of the numbers) - the larger the digit, the fewer numbers in your list start with it. This fact, known as Benford's law, applies to so many different kinds of data sets that it's often used to detect fraud. But why does it work?
Well, if the processes that give rise to your list of numbers do produce a universal distribution of first digits, then this distribution should apply no matter what units you use. It should work no matter if you do your tax return in pounds or in euros, or measure your rivers in metres or miles - it's universal after all. This means that the distribution of first digits remains the same when you multiply your numbers by whatever constant you need to change between units. And it turns out that the only distribution with this property of "scale invariance" is precisely the Benford distribution.
As an example, imagine that your first digits are distributed equally (roughly the same proportion of numbers begin with the digit 1, 2, 3...) – so NOT according to the Benford distribution. Is this distribution scale invariant? Let's see what happens when we multiply by 2. All numbers starting with 5, 6, 7, 8, and 9, when multiplied by 2, give a number starting with 1. By contrast, the only way to end up with a number beginning with, say, 3, is to start out with a number starting with 1. In other words, the resulting distribution of first digits, after multiplying by 2, is skewed towards 1. It's not uniform, so your original distribution is not scale invariant. It's not too hard to show that in order to be scale invariant, the first digits have to be distributed in the way stipulated by the Benford distribution. It's worth noting though that Benford's law only applies to data sets that are neither too random, nor too constrained: alas, it doesn't work for lottery numbers.
Return to the Plus Advent Calendar