Skip to main content

Synthetic data

Contents

What is synthetic data and what does it do?

Synthetic data is ‘artificial’ data generated by data synthesis algorithms. It replicates patterns and the statistical properties of real data (which may be personal information). It is generated from real data using a model trained to reproduce its characteristics and structure. This means that your analysis of the synthetic data should produce very similar results to analysis carried out on the original real data.

It can be a useful tool for training AI models in environments where you are not able to access large datasets.

There are two main types of synthetic data:

  • “partially” synthetic data - this synthesises only some variables of the original data. For example, replacing location and admission time with synthetic data in an A&E admission dataset, but maintaining the real causes for admission); and
  • “fully” synthetic data - this synthesises all variables.

How does synthetic data assist with data protection compliance?

Synthetic data requires real data to generate it, which may involve processing personal information. You can use data synthesis to generate large datasets from small datasets. In cases where you can create synthetic data instead of collecting more personal information, this can help you comply with the data minimisation principle as it reduces your processing of personal information.

You should consider synthetic data for generating non-personal data in situations where you do not need to, or cannot, share personal information.

What are the risks associated with using synthetic data?

You should assess whether the synthetic data you use is an accurate proxy for the original data. The more that the synthetic data mimics real data, the greater the utility it has, but it is also more likely to reveal someone’s personal information.

If you are generating synthetic data from personal information, any inherent biases in the information will be carried through. For example, if the underlying dataset is not representative of your population of interest (eg your customers), neither will the synthetic data which you generate from it. You should:

  • ensure that you can detect and correct bias in the generation of synthetic data, and ensure that the synthetic data is representative; and
  • consider whether you are using synthetic data to make decisions that have consequences for people (ie legal or health consequences). If so, you must assess and mitigate any bias in the information.

To mitigate these risks, you should use diverse and representative training data when generating synthetic data, and continuously monitor and address any biases that may arise.

Is synthetic data anonymous?

This depends on whether the personal information on which you model the synthetic data can be inferred from the synthetic data itself. Assessing re-identification risks involved with synthetic data is an ongoing area of development. You should focus on the extent to which people are identified or identifiable in the synthetic data, and what information about them would be revealed if identification is successful.

Some synthetic data generation methods have been shown to be vulnerable to model inversion attacks, membership inference attacks and attribute disclosure risk. These can increase the risk of inferring a person’s identity. You could protect any records containing outliers from these types of linkage attacks with other information through:

  • suppression of outliers (data points with some uniquely identifying features); or
  • differential privacy with synthetic data.

However, it may reduce the utility of the information and introduce a degree of unpredictability in the characteristics of the information.

You should consider the purposes and context of the processing when you decide what risk mitigation measures are appropriate balanced against what you need to fulfil your purposes. For example, if you intend to use synthetic data in a secure setting (eg a trusted research environment), then the risk of attack may be reduced. You could consider less stringent measures (eg adding less noise than if you were releasing the information to the world at large when using differential privacy). In some cases, you will not be able to achieve an acceptable balance of utility and protection. For example, if you need to capture outliers as part of your purposes for fraud detection. If you do not have the required expertise, you should consult an external expert in setting an appropriate privacy budget when using differential privacy. This will ensure the best trade-off of protection and utility for your purposes.

Further reading

The links below provide useful reading on synthetic data techniques and their associated benefits and risks.

The ONS has proposed a high-level scale to evaluate the synthetic data based on how closely they resemble the original data, their purpose and disclosure risk.

For an evaluation of how synthetic data delivers utility, see Manchester University’s publication “A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records” (external link, PDF).

For more information on how synthetic data can be combined with differential privacy, see “Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe” (external link, PDF)

Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & Patient Privacy” discusses the key requirements of synthetic data to validate and benchmark machine learning algorithms as well as reveal any biases in real-world data used for algorithm development (external link, PDF)