Brief summary
The advent of artificial intelligence poses new threats to the privacy of our personal data. This article explores the challenges and a way to address them.
"Many services on the internet are free, and they are free for a reason: you're the product," says Jorge González Cázares, Assistant Professor at IIMAS - UNAM. "Most companies that have personal information about you sell your data in some way." Companies aren't allowed to sell the information they collect about you without your permission, of course, but there are other ways of making money from it. Big Data means big business.
There are also more altruistic reasons for why an organisation may want to share what they know about their clients. Health records, for example, are invaluable for medical research and public health policy. Information about our travelling patterns is useful for planning. And the information gathered about us in a census belongs to all of us. It's our democratic right to be able to learn from it.
Whether information is being shared for commercial reasons or not, it's important that individuals' data can't be identified from what's being shared. Nobody wants their health records, their financial transactions, or their movements to become public knowledge, potentially to be exploited by criminals. This has been the case ever since we started collecting extensive amounts of data about ourselves. And it applies even more so now, at a time when AI can do infinitely more with data than humans, or traditional algorithms, could ever dream of.
From medical data to AI art
The advances we have seen in artificial intelligence recently, including things like ChatGPT and the amazing feats of generative AI, all rely on machine learning algorithms. The power of these algorithms comes from their ability to spot patterns in data. As a simple example, an algorithm might be trained on a large set of pictures of cats and dogs. It'll find patterns within these images that are linked to them showing a cat or a dog, and then use these patterns to predict, very reliably, if a picture it has never seen before shows a cat or a dog. Exactly what those patterns are that the algorithm has spotted may never become clear to the humans who designed the algorithm. (You can find out more about machine learning in this article.)
Artificial intelligence may soon help radiologists to work through the millions of mammograms performed each year as part of breast cancr screening. Find out more here.
While traditionally organisations may have shared general statistical and mathematical information gleaned from their data sets (see this article), the trend today is towards organisations sharing machine learning algorithms (also called models) trained on their data. An interesting example comes from the medical field, where models are now being developed that can look at the results of a patient's tests and predict the chance of them having a particular disease — from breast cancer to Alzheimer's.
Once these kinds of models are fully operational it would be useful if the organisations that trained them could share them with others. For example, a hospital in Japan might want to share their model, trained on data from Japanese people, with a hospital in Mexico, so that the Mexican hospital has a way of predicting outcomes for Japanese patients on whom it doesn't have much data of its own. The problem here is that the training data comes from real people who'd like their information to remain private. It's important therefore that having possession of the model doesn't mean you can reconstruct any of the data it was trained on.
Another example, where keeping training data inaccessible is important for a different reason, comes from generative AI. Imagine you have developed a model that can generate art — pictures or poems, for example. You will have trained that model on real artworks, but when a user employs it to generate a new piece, you don't want the output to be too similar to a real artwork that was in the training set. Because if it is, you've infringed copyright.
Connecting theory and practise
Our examples show that machine learning models should ideally be designed in such a way that reconstruction of training data is impossible, or at least unlikely. There are techniques that can help you on the way to achieving this. You could, for example, change the training data by random amounts so that it's harder to infer an individual's actual data from a model's output. (For an interesting example of how randomness can protect privacy, see here.) Another approach is to train a model, not on real data, but on a synthetic dataset you have produced to resemble the real data. Both these methods come with trade-offs, though: the less the training data resembles the real data, the harder it will be to reconstruct the real training data, but the less useful will be the model you end up with.
Such methods are being employed by organisations that develop models in situations where there are privacy concerns. The UK regulator, the Information Commissioner's Office, recommends (but does not require) their use. Overall, though, it's not easy to see just how effective these methods would be, for example when faced with a sophisticated, concerted effort to expose the training data (a privacy attack). It's not even clear, says González Cázares, how worried companies are about data protection. "[To find out] the government would have to take the models and [launch privacy] attacks on them, to see if they can find out individual data."
Differential privacy offers a way of measuring the risk to personal data.
In the light of all this it would be useful to have a rigorous theoretical framework for judging the risk a particular model poses to privacy. Such a framework does exist, stemming from a time before artificial intelligence gained the power it possesses today. Differential privacy provides a mathematical way of measuring to what extent an algorithm designed to glean information from a dataset protects the data of individuals. It does this by measuring to what extent the output of the algorithm depends on an individual's data being part of the underlying dataset or not. This makes intuitive sense: if the output barely changes when your information is added to a dataset, then it's harder for adversaries to leverage the difference in outputs to identify your data. See this article to find out more about differential privacy and the intuition behind it.
Differential privacy can help organisations manage the trade-off between privacy and usefulness of models. But seeing how exactly it fits in with machine learning is still a work in progress. And just as the galloping advances in artificial intelligence have somewhat outpaced regulators, they're also outpacing theory.
"Part of the problem with machine learning is that practice moves so much faster than theory," says González Cázares. "It's easy to run an experiment on a massive supercomputer if you work at Google or Amazon without [stringent] theoretical guarantees. You just run it without checking the maths, and if it works it works, you can go and implement it. Theory is much slower to catch up." What's more, says González Cázares, there is not much communication between theorists and practitioners.
It's for this reason that one of the UK's leading mathematical research institution, the Isaac Newton Institute for Mathematical Sciences (INI), recently teamed up with the country's national institute for data science and artificial intelligence, the Alan Turing Institute (ATI), to put on a month-long research programme to investigate aspects of the theory behind machine learning. González Cázares co-organised the programme, and was also involved with a special workshop on differential privacy, run by the INI's impact initiative, the Newton Gateway to Mathematics, in conjunction with the ATI.
The workshop featured presentations on applications in the real world, including AI generated images and machine learning models trained on hospital data, and brought theorists together with practitioners. "[The best outcome was] to get people to work together and exchange ideas, so people from theory understand the [real-world] problems and practitioners gain insights into theoretical tools they were not aware of."
It's a strange situation we're finding ourselves in, with AI holding so much promise but also risks. The joint INI an ATI programme was one of a series of three programmes looking at the mathematical theory behind machine learning. Accelerating this theory so it can catch up with practice is crucial for making the best of new technology while keeping the risks in check.
Abut this article
Jorge González Cázares at the Isaac Newton Institute.
Jorge González Cázares is an Assistant Professor at IIMAS - UNAM in Mexico City. His research includes work in probability, statistics, mathematical finance and machine learning, specialising in Lévy processes, simulation and limit theorems.
González Cázares co-organised the research programme Heavy tails in machine learning Alan Turing Institute in April 2024. It was a satellite programme of the Isaac Newton Institute for Mathematical Sciences.
Marianne Freiberger, Editor of Plus, met González Cázares at the Open for Business event Connecting heavy tails and differential privacy in machine learning organised by the Newton Gateway to Mathematics.
This article was produced as part of our collaborations with the Isaac Newton Institute for Mathematical Sciences (INI) and the Newton Gateway to Mathematics.
The INI is an international research centre and our neighbour here on the University of Cambridge's maths campus. The Newton Gateway is the impact initiative of the INI, which engages with users of mathematics. You can find all the content from the collaboration here.