Task 2: Collect and pre-process your data in an explanation-aware manner

Download options

Due to the Data (Use and Access) Act coming into law on 19 June 2025, this guidance is under review and may be subject to change. The Plans for new and updated guidance page will tell you about which guidance will be updated and when this will happen.

Contents

At a glance

The data that you collect and pre-process before inputting it into your system has an important role to play in your ability to derive each explanation type.
Careful labelling and selection of input data can help provide information for your rationale explanation.
To be more transparent you may wish to provide details about who is responsible at each stage of data collection and pre-processing. You could provide this as part of your responsibility explanation.
To aid your data explanation, you could include details on:
- the source of the training data;
- how it was collected;
- assessments about its quality; and
- steps taken to address quality issues, such as completing, augmenting, or removing data.
You should check the data used within your model to ensure it is sufficiently representative of those you are making decisions about. You should also consider whether pre-processing techniques, such as re-weighting, are required. These will help your fairness explanation.
You should ensure that the modelling, testing and monitoring stages of your system development lead to accurate results to aid your safety and performance explanation.
Documenting your impact and risk assessment, and steps taken throughout the model development to implement these assessments, will help in your impact explanation.

Checklist

☐ Our data is representative of those we make decisions about, reliable, relevant and up-to-date.

☐ We have checked with a domain expert to ensure that the data we are using is appropriate and adequate.

☐ We know where the data has come from, the purpose it was originally collected for, and how it was collected.

☐ Where we are using synthetic data, we know how it was created and what properties it has.

☐ We know what the risks are of using the data we have chosen, as well as the risks to data subjects of having their data included.

☐ We have labelled the data we are using in our AI system with information including what it is, where it is from, and the reasons why we have included it.

☐ Where we are using unstructured or high-dimensional data, we are clear about why we are doing this and the impact of this on explainability.

☐ We have ensured as far as possible that the data does not reflect past discrimination, whether based explicitly on protected characteristics or possible proxies.

☐ We have mitigated possible bias through pre-processing techniques such as re-weighting, up-weighting, masking, or excluding features and their proxies.

☐ It is clear who within our organisation is responsible for data collection and pre-processing.

In more detail

Introduction to collection and pre-processing the data
Rationale explanation
Responsibility explanation
Data explanation
Fairness explanation
Safety and performance explanation
Impact explanation

Introduction

How you collect and pre-process the data you use in your chosen model has a bearing on the quality of the explanation you can offer to decision recipients. This task therefore emphasises some of the things you should think about when you are at these stages of your design process, and how this can contribute to the information you provide to individuals for each explanation type.

Rationale explanation

Understanding the logic of an AI model, or of a specific AI-assisted decision, is much simpler when the features (the input variables from which the model draws inferences and that influence a decision) are already interpretable by humans. For example, someone’s age or location. Limit your pre-processing of that data so that it isn’t transformed through extensive feature engineering into more abstract features that are difficult for humans to understand.

Careful, transparent, and well-informed data labelling practices will set up your AI model to be as interpretable as possible. If you are using data that is not already naturally labelled, there will be a stage at which you will have humans labelling the data with relevant information. At this stage you should ensure that the information recorded is as rich and meaningful as possible. Ask those charged with labelling data to not only tag and annotate what a piece of data is, but also the reasons for that tag. For example, rather than ‘this x-ray contains a tumour’, say ‘this x-ray contains a tumour because…’. Then, when your AI system classifies new x-ray images as tumours, you will be able to look back to the labelling of the most similar examples from the training data to contribute towards your explanation of the decision rationale.

Of course, all of this isn’t always possible. The domain in which you wish to use AI systems may require the collection and use of unstructured, high-dimensional data (where there are countless different input variables interacting with each other in complex ways).

In these cases, you should justify and document the need to use such data. You should also use the guidance in the next task to assess how best to obtain an explanation of the rationale through appropriate model selection and approaches to explanation extraction.

Responsibility explanation

Responsibility explanations are about telling people who, or which part of your organisation, is responsible for overall management of the AI model. This is primarily to make your organisation more accountable to the individuals it makes AI-assisted decisions about.

But you may also want to use this as an opportunity to be more transparent with people about which parts of your organisation are responsible for each stage of the development and deployment process, including data collection and preparation.

Of course, it may not be feasible for your customers to have direct contact with these parts of your organisation (depending on your organisation’s size and how you interact with customers). But informing people about the different business functions involved will make them more informed about the process. This may increase their trust and confidence in your use of AI-assisted decisions because you are being open and informative about the whole process.

If you are adopting a layered approach to the delivery of explanations, it is likely that this information will sit more comfortably in the second or third layer – where interested individuals can access it, without overloading others with too much information. See Task 6 for more on layering explanations.

Data explanation

The data explanation is, in part, a catch-all for giving people information about the data used to train your AI model.

There is a lot of overlap therefore with information you may already have included about data collection and preparation in your rationale, fairness and safety and performance explanations.

However, there are other aspects of the data collection and preparation stage, which you could also include. For example:

the source of the training data;
how it was collected;
assessments about its quality; and
steps taken to address quality issues, such as completing or removing data.

While these may be more procedural (less directly linked to key areas of interest such as fairness and accuracy) there is still value in providing this information. As with the responsibility explanation, the more insight individuals have on the AI model that makes decisions about them, the more confident they are likely to be in interacting with these systems and trusting your use of them.

Fairness explanation

Fairness explanations are about giving people information on the steps you’ve taken to mitigate risks of discrimination both in the production and implementation of your AI system and in the results it generates. They shed light on how individuals have been treated in comparison to others. Some of the most important steps to mitigate discrimination and bias arise at the data collection stage.

For example, when you collect data, you should have a domain expert to assess whether it is sufficiently representative of the people you will make AI-assisted decisions about.

You should also consider where the data came from, and assess to what extent it reflects past discrimination, whether based explicitly on protected characteristics such as race, or on possible proxies such as post code. You may need to modify the data to avoid your AI model learning and entrenching this bias in its decisions. You may use pre-processing techniques such as re-weighting, up-weighting, masking, or even excluding features to mitigate implicit discrimination in the dataset and to prevent bias from entering into the training process. If you exclude features, you should also ensure that you exclude proxies or related features.

Considerations and actions such as these, that you take at the data collection and preparation stages, should feed directly into your fairness explanations. Ensure that you appropriately document what you do at these early stages so you can reflect this in your explanation.

Safety and performance explanation

The safety and performance explanation is concerned with the actions and measures you take to ensure that your AI system is accurate, secure, reliable and robust.

The accuracy component of this explanation is mainly about the actions and measures you take at the modelling, testing, and monitoring stages of developing an AI model. It involves providing people with information about the performance metrics chosen for a model, and about the various performance related measures you used to ensure optimal results.

Impact explanation

The impact explanation involves telling people about how an AI model, and the decisions it makes, may impact them as individuals, communities, and members of wider society. It involves making decision recipients aware of what the possible positive and negative effects of an AI model’s outcomes are for people taken individually and as a whole. It also involves demonstrating that you have put appropriate forethought into mitigating any potential harm and pursuing any potential societal benefits.

Information on this will come from considerations you made as part of your impact and risk assessment (eg a data protection impact assessment). But also from the practical measures you took throughout the development and deployment of the AI model to act on the outcome of the impact assessment.

This includes what you do at the data collection and preparation stage to mitigate risks of negative impact and amplify the possibility of positive impact on society.

You may have covered such steps in your fairness and safety and performance explanations (eg ensuring the collection of representative and up-to-date datasets). However, the impact explanation type is a good opportunity to clarify in simple terms how this affects the impact on society (eg by reducing the likelihood of systematic disadvantaging of minority groups, or improving the consistency of decision-making for all groups).

Example method for pre-processing data

A provenance-based approach to pre-processing data

One approach to pre-processing data for the purposes of explanation is based on provenance. All the information, data dependencies and processes underpinning a decision are collectively known as the provenance of the decision. The PROV data model [PROV-DM] is a vocabulary for provenance, which was standardised at the World Wide Web Consortium. Organisations can use PROV to uniformly represent and link relevant information about the processes running around the AI model, and to seamlessly query it, in order to construct relevant explanations. In addition, for organisations that depend on external data for their decisions, PROV allows for the provenance of data to be linked up across organisation boundaries.

PROV is a standard vocabulary to encode provenance – a form of knowledge graph providing an account of what a system performed, including references to: people, data sets, and organisations involved in decisions; attribution of data; and data derivations. The provenance of a decision enables you to trace back an AI decision to its input data, and to identify the responsibility for each of the activities found along the way. It allows for an explicit record of where data comes from, who in the organisation was associated with data collection and processing, and what data was used to train the AI system. Such provenance information provides the foundations to generate explanations for an AI decision, as well as for making the processes that surround an AI decision model more transparent and accountable.

When PROV is adopted as a way of uniformly encoding the provenance of a decision within or across organisations, it becomes possible to perform a range of tasks. This includes being able to computationally query the knowledge graph capturing the information, data dependencies and processes underpinning the decision. You can then extract the relevant information to construct the desired explanation. Therefore, the approach will help automate the process of extracting explanations about the pipeline around an AI model. Those include explanations on the processes that led to the decision being made, who was responsible for what step in these processes, whether the AI model was solely responsible for the decision, what data from which source influenced the decision etc. However, currently, no work has yet addressed the ability to build explanations for the AI model itself from provenance, so you will need to couple it with another approach (such as the ones presented in Task 3).

This technique can be used to help provide information for the data explanation. It can also provide details for the responsibility, and safety and performance explanations.

There is an online demonstrator illustrating the provenance-based approach described above using a loan decision scenario at: https://explain.openprovenance.org/loan/.

Further reading

For an introduction to the explanation types, see ‘The basics of explaining AI’.

For further details on how to take measures to ensure these kinds of fairness in practice and across your AI system’s design and deployment, see the fairness section of Understanding Artificial Intelligence Ethics and Safety, a guidance produced by the Office for AI, the Government Digital Service, and The Alan Turing Institute.

You can read more about the provenance-based approach to pre-processing data at https://explain.openprovenance.org/report/.