23 and maths

Last week the company 23andMe generated headlines by launching its personalised DNA testing service in the UK. If you'd like to know your risk of developing a range of diseases, all you need to do is request a testing kit, take a saliva sample, send it off, and await the results.

An illustration of a chromosome.

The headlines were mostly to do with the implications of offering such tests to people directly, especially since 23andMe's health-related services were banned in the US last year. There are all sorts of questions to consider, from the emotional effects on people who might make drastic decisions based on their results, to the reliability of the test and whether we really know enough about various diseases to offer a risk assessment based on DNA. But there's also a mathematical question. How do you calculate someone's risk of developing a disease based on their DNA?

To answer this question, here's a little biology first. The DNA we find in people's cells is wrapped into bundles called chromosomes. We each have 23 (hence the name of the company) pairs of chromosomes: the two chromosomes in each pair correspond to the same sequence of genes, but one comes from your mother and one from your father. These sequences are similar, but not identical: your mother and father may have had different versions, called alleles, of a gene. These alleles can be implicated in diseases. For example, the human CFTR gene has a normal version (call it A) and a mutated version (call it a). Depending on what they inherit from their parents, people end up with one of the three pairs AA, Aa, or aa. If it's the latter, so a person has received two mutated alleles, they will be diagnosed with cystic fibrosis.

To make things simple, let's suppose that a person can have one of three different genotypes linked to a disease, one for each of the allele pairs, as in the example above. Call those genotypes

G_{1}

G_{2}

and

G_{3} .

The quantities that you would like to work out are the probabilities that a person develops the disease given they have genotype

G_{i},

for

i

equal to 1, 2 and 3, for which we write

P (D | G_{i}) .

This is a conditional probability and it's defined as:

P (D | G_{i}) = \frac{P (G_{i} \cap D)}{P (G_{i})},

where

P (G_{i} \cap D)

is the probability a person has the disease and genotype

G_{i}

and

P (G_{i})

is the probability a person has genotype

G_{i} .

Now if you are lucky someone might have conducted a study into the disease, which checked the genotype of each of a very large number of people and followed them over years to see if they develop the disease. An estimate for the conditional probability

P (D | G_{i})

would then be the proportion amongst people with genotype

G_{i}

that developed the disease. That would be straight-foward to calculate, but things aren't usually that simple. There may not be such a large study for the disease you're interested in, as following people over years is costly and complicated. It's more likely that somebody has conducted a case control study in which they gathered together a group of people who have the disease and a group who don't, and then looked at their genotypes. Because the numbers of people with and without the disease have been fixed artificially in this case, the calculation above doesn't work anymore: the proportion of people with genotype

G_{i}

that have the disease is no longer a valid estimate for

P (D | G_{i}) .

But there is help at hand. Using a technique called logistic regression statisticians can calculate estimates of numbers called odds ratios. Write

Odds (D | G_{i})

for the odds that a person develops the disease given they have genotype

G_{i} .

The odds ratio

O R_{2} = \frac{Odds (D | G_{2})}{Odds (D | G_{1})}

measures how much bigger (or smaller) the odds of developing the disease are when you have genotype

G_{2}

, compared to genotype

G_{1} .

For example, if

O R_{2} = 2,

then the odds of developing the disease with genotype

G_{2}

are twice the odds of developing the disease with genotype

G_{1} .

Similarly, the odds ratio

O R_{3}

is defined as

O R_{3} = \frac{Odds (D | G_{3})}{Odds (D | G_{1})} .

When 23andMe calculate the conditional probabilities

P (D | G_{i})

they use these odds ratios, as well as estimates of the

P (G_{i})

(the probabilities that someone has the genotype

G_{i}

) and an estimate of the probability

P (D)

that a random person in the population will develop the disease (not conditioned on genotype or anything else). Now, as gamblers will know, odds are slightly different from probabilities. If the odds for a horse to win a race are 3 to 1, then the probability of it winning is 3/4. But luckily there is an easy formula that relates odds and probabilities,

Odds(Event A happens) = \frac{P (Event A happens)}{1 - P (Event A happens)},

so having an estimate of the odds ratio allows you to express

P (D | G_{2})

and

P (D | G_{3})

in terms of

P (D | G_{1})

(you can rearrange the equations above to do this). All you need now to work out the values if

P (D | G_{3})

and

P (D | G_{2})

is a value for

P (D | G_{1}) .

This isn't too hard to find. Since

P (D | G_{i}) P (G_{i}) = P (D \cap G_{i})

we have

P (D | G_{1}) P (G_{1}) + P (D | G_{2}) P (G_{2}) + P (D | G_{3}) P (G_{3}) = P (D \cap G_{1}) + P (D \cap G_{2}) + P (D \cap G_{3}) .

This is just the probability that someone has the disease and genotype

G_{1},

or that they have the disease and genotype

G_{2},

or that they have the disease and genotype

G_{3} .

Since there are only three genotypes this is just the probability

P (D)

of someone having the disease:

P (D | G_{1}) P (G_{1}) + P (D | G_{2}) P (G_{2}) + P (D | G_{3}) P (G_{3}) = P (D) .

Who will get sick?

Since estimates of

P (D)

and the three

P (G_{i})

are available, this gives you an equation with one unknown: simply substitute your expressions of

P (D | G_{2})

and

P (D | G_{3})

in terms of

P (D | G_{1})

into the formula. You're left with

P (D | G_{1})

as the only unknown, for which you can solve the equation exactly. Done.

This method is indeed the one used by 23andMe, or at least it's the one they present in a paper from 2007. Things do get a little more complicated with diseases which are associated not just to one DNA sequence variation, as we assumed above where there were two possible alleles in one particular place in the DNA, but to several variations in different places. In this case 23andMe use products of odds ratios to come up with composite risk estimate for an individual's genotype. You can find out more here.

This isn't an exact science of course. A paper published in January this year compared the predictions of 23andMe's services with that of two other companies offering similar tests, deCODEme, and Navigenics. They found that those predictions could vary considerably. For Crohn's disease for example, over 27% of the (hypothetical) individuals that were part of the study were classified in opposing risk categories by at least two companies. For age-related macular degeneration that figure was nearly 20% and for prostate cancer 15.5%. These differences were partly due to the companies associating different gene variations with the diseases, but they were partly down to the maths. The companies used different estimates for the population risk for some diseases and slightly different formulae to work out the risk for a person with a an individual genotype.

So if you are planning to use 23andMe's services to have your risks assessed, it's probably best to heed the advice of the Medicines and Healthcare Products Regulatory Agency: "Remember that no test is 100% reliable."

Popular topics and tags

Shapes

Numbers

Computing and information

Data and probability

Abstract structures

Physics

Arts, humanities and sport

Logic, proof and strategy

Calculus and analysis

Towards applications

Applications

Understanding of mathematics

Get your maths quickly

23 and maths