Differential privacy: Keeping your data safe

Brief summary

When private data about people is made available to third parties there's always a chance that individuals' data can be identified even if the data is anonymised. Differential privacy is a way of measuring the extent to which releasing information about a database poses a risk to privacy.

How do you feel about your health record, with your name removed, being made available for medical research? It's sensitive data, but the nation's collective medical records hold invaluable information for science and public health. And since your name's been removed, surely you can't be identified?

Not quite. For an example take William Weld, governor of Massachusetts from 1991 To 1997. When the state made health records of its employees available to researchers in 1997 Weld had reassured the population that no one would be identifiable from the anonymised data. A few days later Weld received an envelope containing his very own health record. It had been sent by Latanya Sweeney, then a graduate student at the Massachusetts Institute of Technology, who'd cross-referenced pieces of information still left in the database — sex, date of birth, and ZIP code — with the voter registration records. This was enough to identify Weld.

Clearly, then, if sensitive data is to be made available to third parties, anonymising it isn't enough. This doesn't just go for health records released for public good. Many organisations hold information about us that can fetch a lot of money. If they are going to sell it on, then unless we have granted permission, they should do so in a way that protects our privacy.

Under attack

One way of protecting private information stored in a database is to never give anyone access to the actual database, but only give them summary statistics. For example, you might only allow third parties to see sums or averages that have been calculated over many entries in the database, so individual records are subsumed by the bulk. Organisations holding the data could publish such statistics themselves, or allow third parties to query the databases for these numbers, for example through interactive computer systems.

It turns out that randomness is a crucial ingredient in data protection

This seems innocuous enough, until you realise that a clever combination of statistics can tease out the very information an organisation is trying to protect. As an example, imagine the database lists the incomes of all the people in a city and that third parties have access to averages. All an "attacker" needs to do to reveal the individual entry in row X is to look up the average calculated over all the entries and then look up the average calculated over all entries except X. These two averages — with and without X — are enough information to work out the exact value of X.

Repeating this for all rows allows the attacker to work out all entries in the list. Even if the list is anonymised, cross-referencing with other information, Latanya Sweeney style, may reveal some of the names behind the rows. The attacker can then publish individuals' incomes online, sell the information on, or extort money in exchange for keeping their mouth shut.

This simple example is easy to remedy — disallow queries for the types of average in question — but the problem runs deeper. To allow third parties to gain meaningful information from a database at all, you do need to give them access to a lot more than just one or two summary statistics. After a census, for example, billions of statistics are usually made public. Guaranteeing safety by checking whether one of the many combinations of statistics can identify individuals is an impossible task (and indeed the Office of National Statistics applies various stringent measures to ensure data protection.)

Randomness: Your private friend

Randomised response

Since a coin has a fifty-fifty chance of coming up heads, you know that around 50 of the respondents will have answered "yes" to the drugs question because the coin came up heads for them. If the total number of "yes" answers is, say, 60 you know that around 10 of those come from people who responded truthfully. Those 10 truthful answers come from the 50 or so people who got tails when they tossed the coin. This means a proportion of 10/50, or 20%, of the tails group of people use drugs.

Since your sample of people is quite large, and the tail group of people were selected randomly by coin flip (rather than based on a particular characteristic) you can assume the group is characteristic of the entire sample of 100 people. And if the entire sample was chosen to be representative of the population as a whole, then you now have an estimate for how many people regularly take drugs: it's around 20%. You've been able to make this estimate without compromising anyone's privacy.

Another approach to protecting privacy is to add a little randomness. As a toy example, imagine you're conducting a lifestyle survey among 100 people, which includes the question of whether they regularly use illegal drugs. Those who do are unlikely to respond truthfully. What you can do instead is ask them to find their purse and fish out a coin. When it comes to answering the drugs question, tell them to go to a private place and flip the coin. If it comes up heads, tell them to answer "yes" to the drugs question, regardless of whether or not they use drugs. If it comes up tails, tell them to answer truthfully.

This protects every individual from prosecution, even if the police were to get hold of their answers. If they answered "yes" they can simply say they tossed heads — no one can prove otherwise. At the same time, because you understand the nature of the randomness (it comes from a coin flip) you can still estimate how many of the 100 people regularly take drugs. See the box to find out how.

A fundamental law of privacy-loss

The technique we just described is called randomised response, and was developed in the 1960s, long before the world was flooded with the volume of data we're seeing now. There are more sophisticated versions of it, all based on the idea that randomness can protect individuals, while still making it possible to infer important statistical estimates.

The idea can be generalised to our set-up from above, where individual entries from a database are never released, but certain statistics are published: instead of publishing precise numbers, add a little randomness to them. There's no need for a coin to do this. You can instead devise an algorithm that chooses the amount of randomness to be added from a suitable probability distribution.

Of course there is a trade-off here. Adding too much randomness renders the blurred statistics meaningless. Adding too little may mean that privacy isn't sufficiently protected. This echoes the trade-off we already spotted above. Publishing too few statistics will make the dataset useless to those wanting to learn from it. Publishing too many may mean that privacy is compromised.

Indeed these trade-offs can lead to death by a thousand cuts. A devastating theoretical result published in 2003 by Kobbi Nissim and Irit Dinur showed that, unless a lot of randomness is added, a clever adversary with access to sufficiently many statistics can, in theory at least, reconstruct a database almost exactly. While a single statistic reveals almost no information about an individual — it's a tiny cut — many statistics taken together can completely destroy an individual's privacy, even when the numbers have been changed with a little bit of randomness.

The result has become known as the fundamental law of information recovery. It's as inescapable as a law of nature.

Differential privacy

In the light of the fundamental law it would be good to at least have a way of gauging how much risk there is to individuals' data when information gained from a dataset is published. Differential privacy, a concept developed in response to Nissim and Dinur's result, provides just that.

To understand the idea, imagine you have a database (or many of them) and are trying to decide what kind of statistics to give people access to. Statistics (even those with randomness added in) are produced by calculations, by algorithms, so your question becomes what kind of algorithms to allow. The general idea behind differential privacy is that the risk to privacy can be measured by the extent to which the algorithms' outputs depend on individual entries in the database — if an algorithm outputs very nearly the same result whether you calculate it from the entire database or the database with a single entry removed, then it is deemed relatively low risk to privacy.

Differential privacy could help organisations and regulators.

If this is the case, then people can feel safe knowing their information is included in a database — because their information makes very little difference to any statistics released, it's almost as safe as if it wasn't included in the database at all. The idea echoes our example above, involving the list of incomes: it was the difference between averages calculated with and without a particular entry that allowed identification of an individual income.

There are several definitions of differential privacy, which we won't include here as they're quite technical. You can see the definition of what's known as pure differential privacy, developed by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith in 2006, below. It includes a parameter $\epsilon$ which measures the dependency of outputs of an algorithm on individual entries and therefore the level of protection to individual entries privacy. Randomness is a crucial ingredient here. Adding randomness to the outputs of an algorithm will increase the level of protection, but also make the information it provides less useful.

The measure provided by differential privacy can help organisations that hold sensitive data negotiate the trade-off between usefulness of statistics and privacy. They can set a "privacy-loss budget" and then make sure that all published information stays within the budget. Very sensitive data would come with a low budget, meaning that the information released about it will be less useful. Less sensitive data can come with a higher budget, providing better accuracy.

Since the inception of differential privacy in the 2000s it's seen several real-world applications. But since then the world has also moved on massively as far as data are concerned. In this article we introduced the concept of differential privacy in the context of traditional algorithms designed to compute (randomised) statistics. Latest real-world applications, however, have to deal with a threat of a different order of magnitude: artificial intelligence spotting patterns in data that classical algorithms had no chance of ever working out. This throws a whole new perspective on the uses of differential privacy. We will find out more in this article.

A definition of differential privacy

Here's the definition of $\epsilon$-differential privacy. Suppose you have a randomised algorithm $A$, which takes datasets as its inputs. The set of its outputs is called the image of $A$. Let $\epsilon$ be a positive real number. Then the algorithm $A$ is said to provide $\epsilon$ -differential privacy if, for all datasets $D_1$ and $D_2$ that differ on a single element, and all subsets $S$ of the image of $A$ we have $$\frac{Pr(A(D_1)) \in S}{Pr(A(D_2))} \leq e^\epsilon.$$ Here $Pr$ denotes probability defined in terms of the randomness used by $A$. You can see that if $\epsilon=0$ the ratio of the to probabilities is $1$, which means they are equal: there is no difference between the probabilities of outputs lying in $S$, so there is no dependence on individual elements in the database.

To find out more about the properties of $\epsilon$ -differential privacy see Wikipedia.

About the author

Marianne Freiberger is Editor of Plus.

This article was produced as part of our collaborations with the Isaac Newton Institute for Mathematical Sciences (INI) and the Newton Gateway to Mathematics.

The INI is an international research centre and our neighbour here on the University of Cambridge's maths campus. The Newton Gateway is the impact initiative of the INI, which engages with users of mathematics. You can find all the content from the collaboration here.

Plus.Maths.org

Add new comment