## 23 and maths

Submitted by Marianne on December 9, 2014Last week the company 23andMe generated headlines by launching its personalised DNA testing service in the UK. If you'd like to know your risk of developing a range of diseases, all you need to do is request a testing kit, take a saliva sample, send it off, and await the results.

An illustration of a chromosome.

The headlines were mostly to do with the implications of offering
such tests to people directly, especially since 23andMe's
health-related services were banned
in the US last year. There are all sorts of questions to
consider, from the emotional effects on people who might make drastic
decisions based on their results, to the reliability of
the test and whether we really
know enough about various diseases to offer a risk assessment based on
DNA. But there's also a mathematical question. How *do* you
calculate someone's risk of developing a disease based on their
DNA?

To answer this question, here's a little biology first. The DNA we
find in people's cells is wrapped into bundles called
chromosomes. We each have 23 (hence the name of the company) pairs of chromosomes: the two chromosomes
in each pair correspond to the same sequence of genes, but one comes
from your mother and one from your father. These sequences are
similar, but not identical: your mother and father may have had
different versions, called *alleles*, of a gene. These alleles
can be implicated in diseases. For example, the human CFTR gene has a
normal version (call it *A*) and a mutated version (call it
*a*). Depending on what they inherit from their parents, people end
up with one of the three pairs *AA*, *Aa*, or
*aa*. If it's the latter, so a person has received two mutated alleles, they will be
diagnosed with cystic fibrosis.

To make things simple, let’s suppose that a person can have one of three different genotypes linked to a disease, one for each of the allele pairs, as in the example above. Call those genotypes , and The quantities that you would like to work out are the probabilities that a person develops the disease *given* they have genotype for equal to 1, 2 and 3, for which we write

This is a conditional probability and it’s defined as:

where is the probability a person has the disease *and* genotype and is the probability a person has genotype

Now if you are lucky someone might have conducted a study into the disease, which checked the genotype of each of a very large number of people and followed them over years to see if they develop the disease. An estimate for the conditional probability would then be the proportion amongst people with genotype that developed the disease.

That would be straight-foward to calculate, but things aren’t usually that simple. There may not be such a large study for the disease you’re interested in, as following people over years is costly and complicated. It’s more likely that somebody has conducted a *case control study* in which they gathered together a group of people who have the disease and a group who don’t, and then looked at their genotypes. Because the numbers of people with and without the disease have been fixed artificially in this case, the calculation above doesn’t work anymore: the proportion of people with genotype that have the disease is no longer a valid estimate for

But there is help at hand. Using a technique called *logistic regression* statisticians can calculate estimates of numbers called *odds ratios*. Write

for the *odds* that a person develops the disease given they have genotype The odds ratio

measures how much bigger (or smaller) the odds of developing the disease are when you have genotype , compared to genotype For example, if

then the odds of developing the disease with genotype are twice the odds of developing the disease with genotype Similarly, the odds ratio is defined as

When 23andMe calculate the conditional probabilities they use these odds ratios, as well as estimates of the (the probabilities that someone has the genotype ) and an estimate of the probability that a random person in the population will develop the disease (not conditioned on genotype or anything else).

Now, as gamblers will know, odds are slightly different from probabilities. If the odds for a horse to win a race are 3 to 1, then the probability of it winning is 3/4. But luckily there is an easy formula that relates odds and probabilities,

so having an estimate of the odds ratio allows you to express and in terms of (you can rearrange the equations above to do this). All you need now to work out the values if and is a value for

This isn’t too hard to find. Since

we have

This is just the probability that someone has the disease and genotype or that they have the disease and genotype or that they have the disease and genotype Since there are only three genotypes this is just the probability of someone having the disease:

Who will get sick?

Since estimates of and the three are available, this gives you an equation with one unknown: simply substitute your expressions of and in terms of into the formula. You’re left with as the only unknown, for which you can solve the equation exactly. Done.

This method is indeed the one used by 23andMe, or at least it's the one they present in a paper from 2007. Things do get a little more complicated with diseases which are associated not just to one DNA sequence variation, as we assumed above where there were two possible alleles in one particular place in the DNA, but to several variations in different places. In this case 23andMe use products of odds ratios to come up with composite risk estimate for an individual's genotype. You can find out more here.

This isn't an exact science of course. A paper published in January this year compared the predictions of 23andMe's services with that of two other companies offering similar tests, deCODEme, and Navigenics. They found that those predictions could vary considerably. For Crohn's disease for example, over 27% of the (hypothetical) individuals that were part of the study were classified in opposing risk categories by at least two companies. For age-related macular degeneration that figure was nearly 20% and for prostate cancer 15.5%. These differences were partly due to the companies associating different gene variations with the diseases, but they were partly down to the maths. The companies used different estimates for the population risk for some diseases and slightly different formulae to work out the risk for a person with a an individual genotype.

So if you are planning to use 23andMe's services to have your risks assessed, it's probably best to heed the advice of the Medicines and Healthcare Products Regulatory Agency: "Remember that no test is 100% reliable."