This use case study is without prejudice to laws and regulatory practices in any jurisdictions. In addition to the steps described in this case study, other steps may also be required, both to meet national data protection laws and other legislation and healthcare laws and practices. For example, this case study does not include an ethics review, which may be a requirement for re-use of health data in some jurisdictions.
At a glance
- This hypothetical use case aims to demonstrate how synthetic data, generated from a real medical prescription dataset, can be used as part of a testing strategy for the development of a healthcare planning and resource allocation system, without jeopardising the privacy of real patients.
- Fictitious (though realistic) parameters were chosen to ensure that the case study is comparable to a real-world synthetic data use case.
In detail
- This case study explains how synthetic datasets generated from prescription databases can be used to develop and test a system for planning public services, without the need to share sensitive patient information with the parties involved in developing the service.
- The synthetic data provides detailed insights into the specific healthcare needs of different regions. This allows the development of a system which enables planners and decision-makers to be more targeted, which can help the national healthcare system to more effectively meet local needs. The synthetic data provides the level of detail required for the development of a tailored system, without sharing real patient data outside of the healthcare system.
- The case study provides information about how such a process can be carried out and recommended technical and organisational measures.
In this case study, a health authority wants to develop a system with the following functionality:
- Identification of regions with high prescription rates for specific conditions that might be under-served by existing healthcare infrastructure;
- Analysing trends for specific regions or conditions, for various purposes including resource allocation, efficiency in service delivery and helping policy-makers in budget allocation;
- Helping to plan the allocation and readiness of emergency services;
- Identifying where new healthcare facilities or services are most needed; and
- Developing long-term public health strategies, such as preventative programmes.
Before deploying a national system involving real patient data, the healthcare authorities will design and test a small-scale prototype system using synthetic data to identify any problems, data handling issues, or inaccuracies in data interpretation. This approach helps to speed up the development cycle and improve the functionality of the system. Furthermore, synthetic data will:
- be used to replicate rare but plausible scenarios which the system must be able to handle. Using synthetic data is useful because there may be insufficient real-world data available to adequately test these scenarios. This is done by oversampling rare instances so they scale to the size of the synthetic dataset.
- replicate the granularity of real-world prescription data. This allows for a more realistic test for the deeper, more nuanced analysis that the system will perform. This will lead to more accurate results than the alternative of using aggregated data which may overlook details that are necessary for understanding underlying trends.
- demonstrate system functionality works as expected. This helps with gaining stakeholder trust before full-scale implementation of the planning system.
- facilitate faster innovation as the healthcare authority can start developing the system and determine the appropriate governance for using real data by first testing its effectiveness on the synthetic data.
Context
The scenario involves the following parties:
- Regional branches of the national health authority: use their data-base of prescribed medicines relevant to their own geographic areas within the country to generate the synthetic data, which they then make available to the third-party development team for synthesis of the national data pooled from all regional branches. The health authority has its own in-house analytics team to perform the data synthesis tasks.
- Software testing to ensure that the synthetic data generation software is tested to determine if all the required features for data generation, e.g. handling different data types and supporting various scenarios, are present and working as expected.
- Validation to ensure that the system used to process synthetic data operates correctly and efficiently. For the synthetic data generation software, this means separate tests to verify that the synthetic data generation algorithms correctly replicate the statistical properties of the original prescriptions data.
- Identifiability testing to ensure that no individuals can be identified from the synthetic data.
- Third-party solutions provider: use the synthetic data to design and test the prototype public health planning system. They will test for testing functionality such as allocation of medical resources and service delivery planning. The testing process involves performance testing and testing of the analytics performed on the synthetic data.
- Performance testing to assesses the system used to process synthetic data. For example, ensuring the test system software is computationally efficient and can handle increasing amounts of data without significant performance degradation.
The prescription datasets held by each regional branch of the health authority contains information about prescription drugs dispensed within their respective region. After pooling, the combined dataset holds information on approximately 60 million people over 700 million rows. This holds the following types of data:
- Prescription Details: These include the month and year of the prescription, medication name, dosage, frequency, and duration.
- Pharmacy Location: the geographic area where the prescription was filled. A unique alphanumeric code is used to represent each geographic area.
- Health Condition: Basic information about the disease or condition for which the medication is prescribed, e.g. Type 2 diabetes.
- Basic demographic information: Basic information regarding the patient – this includes age range, sex and gender of patient, geographic area of res-idence.
The flowchart below shows the process for generating synthetic data, including the step by which prescription data is processed by the health authority to generate area-level synthetic data. The data is prepared and the area-level data is combined into a single dataset before being synthesised by the national health authority. A centralised approach was used as synthesising from a pooled dataset of prescriptions data provides much better utility to fulfil the purposes rather a federated approach of pooling individually synthesised datasets into a national dataset. Once synthesised, the data is made available to the third-party development team for development and testing purposes.
Examples
Analysis of the prescription dataset may reveal information about patterns in prescription rates amongst certain groups. For example:
- an above average prescription rate for Type 2 Diabetes medication in a specific geographic area among the age group 50-60, leading to increased resource allocation.
- lower level of prescription for high blood pressure medication in 50-70 year olds in a different geographic areas, leading to research in possible environmental factors that reduce disease.
- An above average prescription rate of psychiatric medication, leading to increased provision of mental health facilities in that area.
- National trends in prescribing specific medication for the purpose of epidemiological surveillance. For example, monitoring levels of influenza at a national-level.
Generating synthetic data allows the systems that can generate these insights to be developed and tested without using real patient data, hence contributing to preserve their privacy.
Challenges of this case study
- Reducing the need to access or share personal data of real patients and at the same time preserving the statistical utility of the data (i.e. finding the right balance between privacy and utility);
- Identification of real patients can have a significant impact on them, and also on the trust they place in health authorities to handle their data;
- High-level of expertise in different topics may be needed:
- Health knowledge;
- Statistics and data analytics expertise;
- Artificial intelligence and synthetic dataset generation;
- Privacy technology expertise.
Data processing phases and technical measures
- In this case, the national health authority prepares and analyses the data before they derive the synthetic dataset from the database containing prescription data, as they have the required expertise ‘in house’. 1
- The first step is to assess whether the prescription data accurately reflects the health conditions of the population. This will involve several steps, including validation with external data on health statistics.
-
Data preparation phase
Before generating the synthetic data, the in-house data analytics team at the national healthcare authority:- Removes irrelevant attributes in the prescriptions dataset, for example, time of prescription and direct identifiers.
- Standardises the data to ensure that all of the fields are using consistent coding schemes.
- Sanitises the data to remove direct identifiers to reduce the identifiability of the patients.
-
Data analysis phase
The overall goal of the data analysis phase for the health authority is to thoroughly understand the structure, relationships, and key statistical properties of the prescription dataset. This understanding is critical for generating high-quality synthetic data that accurately represents the original data while minimising the identifiability of the patients. -
Synthetic Data Generation phase
The provider needs to choose an appropriate technique for generating the synthetic data. The dataset contains a mix of categorical and numerical data, therefore the health authority needs to consider methods which are able to handle diversity in the data.
A data model or dictionary is used by the health authority to understand the structure and characteristics of the dataset, which in turn helps in determining the best approach to generating synthetic data. The data model/dictionary outlines the structure of the data, including tables, columns, data types (e.g. numerical, categorical), and the relationships between different entities (e.g., prescriptions and health conditions). Additionally, it includes the following features:-
constraints like primary and foreign keys, which ensure referential integrity between different data entities
-
specifies valid ranges for numerical data (e.g., dosage amounts) and possible categories for categorical data (e.g., types of health conditions).
-
Additional metadata, such as the frequency of missing values and typical data distributions, in order to assess the quality of the data and for governance and management purposes.
-
based on the data analysis and purposes, the health authority considers the available synthesis methods including as VAEs, GANs and Bayesian Networks. 2
-
each method is assessed for suitability for the data types and relationships in the prescription data. For example, GANs are suitable for complex, high-dimensional data, while Bayesian Networks are also useful for modelling the complexity and dimensionality of healthcare datasets.
Once a suitable method has been chosen, prior to model training, the data is split into a training set, and a validation set, using the following ratios: 70% for training and 30% for validation.
The purposes require that the risk of identifying any individuals from the original prescriptions data is negligible, therefore the provider decides to also implement Differential Privacy (DP) as an additional technical measure. In this case study, noise is applied to ensure that the model outputs are differentially private. The DP approach must be carefully calibrated to preserve the utility of the synthetic data for its intended purposes while ensuring the risk of identifying any of the individuals from the prescriptions dataset is mitigated.
If neural network-based models are used (e.g. GANs) to generate synthetic data, noise is added to the gradients during model training. This helps ensure the model does not learn to reproduce exact details that could lead to re-identification of individuals, providing the privacy budget (epsilon value) is appropriately set for the use case.
If other models are used, noise is added to the objective function to help ensure the model learns generalised patterns rather than information on individuals.
If a machine learning approach is used and the training is allowed to proceed for too long, then the model may overfit to the training data. Overfitting could lead the model to reproduce specific data points rather than general patterns, which might increase the risk of leaking identifiable information.Validation techniques are critical to ensure that the model is generalising well and not overfitting to the training data, for example the health authority performs early stopping. This involves monitoring the model's performance on a validation set during training. If the performance on the validation set starts to deteriorate (indicating that the model may be starting to overfit to the training data), they stop training.
The health authority also uses distance metrics 3 to quantify the difference between the distributions of the real and synthetic data. Overfitting can be detected (e.g. comparisons over extremely small distances), especially if anomalies or outliers from the training dataset are replicated.
In this case study, the provider is able to strike an appropriate balance to ensure the synthetic data is still useful for its intended purposes. The level of noise is added based solely on privacy considerations such as the sensitivity of the data and checked against factors such as the required fidelity of the synthetic data for the purposes. The goal is to produce synthetic data that is representative enough for meaningful analysis but altered sufficiently to minimise the risks of re-identification. -
4. Synthetic Data validation phase:
Utility assessment
After generating the synthetic data, its utility are assessed to ensure it fulfils the purposes. The health authority then compares the synthetic dataset with the original dataset to ensure that the synthetic data reflects the statistical properties of the original data .This involves statistical comparisons with the real data to ensure key patterns and distributions are preserved, e.g. heath conditions, prescribed medication and age ranges. The health authority then compares the synthetic dataset with the original dataset to ensure that the synthetic data reflects the statistical properties of the original data and to ensure that no real patient’s data is replicated in the dataset.
The health authority queries both the real and validation set of the synthetic dataset with the same set of random three-way marginal queries (e.g., counting the number of people with a specific age, medical condition, and medication) and then calculating the Mean Relative Error (MRE) between the counts from the real and synthetic data. MRE provides a quantitative measure of how well the synthetic data captures the joint distribution of multiple attributes. It is particularly relevant for assessing complex interactions and co-occurrences in data, which are important for this use case.
Further manual correction is carried out by the health authority to remove any instances of implausible records, for example a contraceptive pill prescribed to a male.
The health authority uses an open source package 4 for evaluating tabular synthetic data. This is used to compare synthetic data against the real data using a variety metrics, for example:
-
- Marginal Distributions: These metrics compare the distributions of individual variables in the synthetic dataset to those in the original dataset. For prescription data, it is important that the distribution of medications, dosages, and patient demographics in the synthetic data closely matches the real data.
- Correlation Metrics: These assess whether the relationships (e.g., correlations) between different variables (like specific health conditions and prescriptions) are preserved in the synthetic data.
Considerations related to the risk of re-identification
The health authority compares the synthetic dataset with the original dataset to ensure that no real patient’s data is replicated in the dataset. The healthcare authority then performs intruder testing using simulated attack scenarios to gauge the robustness of the synthetic data against potential re-identification attempts. This involves motivated intruder attacks and outlier detection to assess how feasible it is to associate a synthetic record with a real individual. 5
The health authority attempts to match the synthetic records to real individuals, e.g. trying to link the synthetic data with the prescriptions dataset, looking for unique patterns in the data, or identifying outlier records in the synthetic data that correspond to real individuals. If certain synthetic data points are too close to the original data, they are rejected.
The health authority evaluates the success of their attempts to re-identify individuals, based on whether it is possible to match a synthetic record to a real person or learn something new about them from the synthetic data.
After generating synthetic data, outlier detection is applied using appropriate statistical methods to ensure the synthetic dataset does not contain unrealistic or data points that represent rare edge cases or individuals. 6
The health authority also uses an external security testing company, to perform additional adversarial attacks and benchmarks, for example:
-
- Membership inference attacks: These attacks aim to determine whether a particular data point was part of the training dataset.
- Attribute inference attacks: These attacks try to infer sensitive attributes of individuals in the training dataset based on the synthetic data and the knowledge of some publicly available information about a specific record relating to an individual.
- Model inversion attacks: These attacks attempt to reconstruct the training data from the model's parameters or outputs. This is done to help identify any potential weaknesses that could be exploited if information about the model were to be leaked intentionally or accidentally, as the data might be more vulnerable to model inversion attacks in the event of a leak.
The health authority does not release the synthetic data generator publicly to mitigate the risk of it being reverse-engineered to compromise the data. access to the generator is strictly limited to necessary staff only within the health authority. The healthcare authority does not give the third-party solutions provider access to the entire training data. Instead, they give the provider access to a subset of real prescription data to test the success of the attack.
Organisational measures
The healthcare authority strictly controls access to the original dataset. They establish a role-based access control system, ensuring that only individuals with relevant roles can access the data. They also have a comprehensive logging and monitoring system in place. This system records every access to and interaction with the data, creating an audit trail that can be reviewed in case of any security incidents.
Separation of duties are enforced between the staff that has access to the original identifiable prescriptions data and the staff involved in the data analysis phase to mitigate the risk of unauthorised re-identification. The risk of re-identification is much lower for the synthetic data, therefore there are less strict processes in place around sharing, but it still has security controls in place.
The health authority also put in place contractual agreements with both the third-party solutions provider and external security testing company who have access to the data. These agreements clearly state that any attempt to re-identify individuals is prohibited. All individuals within the external providers involved in the service also must undertake specific training in relation to handling data. This training covers data protection, information governance, and information security.
Ongoing monitoring of the system
After the system has been developed, full control is handed over to the health authority. The synthetic data is used to test the functionality of the system before real data is used with it. If new functionality is introduced over time, e.g. new variables are needed for analysis, the scope of the purposes changes, then the health authority may need to generate new synthetic data to test this functionality and reflect changes in the real data over time. In this case, it would be necessary to repeat the data collection and pooling, synthetic data generation, validation and testing phases as described in this case study.
How do the technical and organisational measures achieve the objective?
Implementing the technical and organisational measures described in this use case study allows the parties to confidently derive insights from the synthetic data while ensuring the risk to individuals is minimised.
Acknowledgements
This case study has been co-developed by the data protection authorities of the G7, as part of the G7 Emerging Technologies Working Group. Its development was led by the UK Information Commissioner's Office (ICO).
The G7 data protection authorities thank the following experts for their review and feedback on this case study:
- The IEEE Synthetic Data working group
- Professor Khaled El Emam, University of Ottawa, Canada
- Dr Jonathan Pearson, NHS England Digital Analytics and Research, UK
- Professor Pierre-Antoine Gourraud, Université de Nantes, France
- Dr Aurélien Bellet, National Institute for Research in Digital Science and Technology (Inria), France
1 Data synthesis algorithms are advanced techniques that require a good understanding of machine learning and the specific algorithms to use effectively. If internal expertise is not available then an external expert should be used to generate synthetic data. In this case, additional safeguards and obligations, such as contractual controls, and strong technical and organisational measures would be required.
2 VAEs are a type of machine learning model that can be used for synthetic data generation. They work by encoding the input data into a lower-dimensional space, and then decoding it to generate new data. GANs work via an adversarial process, in which two models are trained simultaneously: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than the generative model. A Bayesian network is a graphical model of the joint probability distribution for a set of variables and can be used to create synthetic data.
3 Such as Euclidean distance, Manhattan distance, or more complex measures like Earth Mover’s Distance (EMD).
4 Such as Synthetic Data Metrics.
5 Re-identification attacks against synthetic data are a relatively new area of re-search. Hence the assessment on whether synthetic data has residual risks of re-identification should be performed on a regular basis, taking into account further development in that research.
6 See 1.3.5.17. Detection of Outliers (nist.gov) for a discussion of appropriate statistical methods.