It might be expected, prima facie, that roughly the same number of surnames in a sample would begin with each letter of the Roman alphabet and that the proportions of surnames categorised by their initial letters would be approximately uniform and equal to 1/26.
However, for many kinds of alphabetic data, the distribution of initials is skewed. A mathematical relationship (known as Benford's law for numeric data) seems to hold when adapted to model alphabetic data.
See http://plus.maths.org/issue9/features/benford/ regarding numeric data.
Using logs with base 27, the expected proportion (P) of surnames beginning with any letter is P = log[(n+1)/n], where 0 < n < 27 is the alphabetic rank of the letter and the cumulative function of P = log[(n+1)/n] is Sum(P) = log(n+1).
This model indicates a probability that 33% of a sample of surnames will begin with either A or B and that 67% of the surnames in that sample can be expected to begin with one of the eight letters from A to H.
A generalised version of this law would not work for truly random sets of data. It would work best for data that are neither completely random nor overly constrained, but rather lie somewhere in between. These data could be wide ranging and would typically result from several processes with many influences.
Michael Mernagh, Cork, Ireland. February 17, 2011.

## An Adaptation of Benford’s “Law”

It might be expected, prima facie, that roughly the same number of surnames in a sample would begin with each letter of the Roman alphabet and that the proportions of surnames categorised by their initial letters would be approximately uniform and equal to 1/26.

However, for many kinds of alphabetic data, the distribution of initials is skewed. A mathematical relationship (known as Benford's law for numeric data) seems to hold when adapted to model alphabetic data.

See http://plus.maths.org/issue9/features/benford/ regarding numeric data.

Using logs with base 27, the expected proportion (P) of surnames beginning with any letter is P = log[(n+1)/n], where 0 < n < 27 is the alphabetic rank of the letter and the cumulative function of P = log[(n+1)/n] is Sum(P) = log(n+1).

This model indicates a probability that 33% of a sample of surnames will begin with either A or B and that 67% of the surnames in that sample can be expected to begin with one of the eight letters from A to H.

A generalised version of this law would not work for truly random sets of data. It would work best for data that are neither completely random nor overly constrained, but rather lie somewhere in between. These data could be wide ranging and would typically result from several processes with many influences.

Michael Mernagh, Cork, Ireland. February 17, 2011.