The ICO exists to empower you through information.

This section is aimed at technical specialists and those in compliance-focused roles to understand the risks associated with data minimisation in AI.

Control measure: There is a review of the relevance of personal information at each stage of system development and training before go live, including detailed justification for the retention of the information and confirmation that irrelevant information has been removed or deleted.

Risk: Without appropriate reviews being undertaken at each stage, there is a risk of inappropriate retention of information. This may breach UK GDPR article 5 (1)(c).

Ways to meet our expectations:

  • Assess whether the features used to train the AI system are adequate, relevant and limited for the purpose. Ensure the system is designed so only that information is processed.
  • Consider applying privacy-preserving techniques, such as differential privacy, homomorphic encryption, or federated learning, where resource allows. These techniques allow the AI model to be trained on decentralised information sources without exposing raw, sensitive information.
  • Ensure the training data can be modified to reduce the extent it can be traced back to specific people.
  • Include specific review phases in development plans to check information is minimised and not retained when no longer needed.
  • Include a justification in the DPIA for retaining information, where applicable.
  • Confirm that inadequate and irrelevant information is removed or deleted during the system development phase.

Options to consider:

  • Use synthetic or anonymised information to train the AI model.
  • Assess the retention of information used in both the training and inference stages of model development. 

 

Control measure: There is ongoing monitoring and testing of the use of information to ensure: 

  • only the minimum information required is being processed by the AI system;
  • any unnecessary duplicated information is identified and tracked (eg through automated data tracing); and 
  • any unnecessary duplicated information is then deleted, where necessary. 

Risk: If information is not assessed and then separated, there is a risk that excessive information will be processed and retained for longer than is necessary. 

Similarly, if unnecessary duplicate information is created, processed or stored in the system, there is a risk that the datasets as a whole are excessive. This may breach UK GDPR article 5(1)(c).

Ways to meet our expectations:

  • Standardise the checks required in a checklist or test plan which includes:
    • a check of the current features within the system;
    • a review of retention of information; and 
    • potential further minimisation of information used.
  • Assess whether: 
    • all the information is needed (eg would the whole address or just the postcode produce the same result); and 
    • the same volume of information is required (or whether the same results can be achieved with less volume).
  • Consider document 'cropping' or redaction for both collection and sharing purposes.
  • Map out all the processes in the different phases of your AI system that personal information is used in.
  • Ensure the mapping, and then subsequent assessment for the potential minimisation of information, includes information used in the production of the system. Then retrain the system as part of ongoing research.
  • Index the personal information used in each phase of the AI system lifecycle.
  • Implement automated data tracing to track the information processed across the whole system.
  • Have measures to detect any duplicated information present in different phases (from production to research) and delete where necessary.
  • Implement audit trails that log and track the usage of information throughout the AI system. This includes recording access, modifications, and processing activities. Regularly review and analyse these logs to identify any unnecessary processing.
  • Implement data masking or anonymisation techniques to replace or obscure personally identifiable information and other sensitive information. This ensures that only necessary information is visible to the AI system while preserving privacy.
  • Integrate automated information validation checks to verify the integrity and quality of incoming information. This helps identify and filter out irrelevant or erroneous information, ensuring that only accurate and relevant information is processed.
  • Deploy continuous monitoring tools that assess the AI system's compliance with information use policies. These tools can trigger alerts or notifications when deviations from established processing norms are detected.

Options to consider:

  • Use automated profiling tools to analyse datasets and identify patterns, outliers, and potentially sensitive information. These tools can help to understand the nature of the information and assist in determining which elements are essential for the AI system's operation.

 

Control measure: There is a documented retention policy and schedule in place and evidence that the schedule is adhered to (ie personal information is deleted in line with the schedule, or retention outside the schedule is justified and approved).

Risk: Without documented, monitored and adhered to schedules for retention, there is a risk that personal information will be retained for longer than necessary and become inaccurate and excessive for the purposes it was collected for. This may breach UK GDPR article 5 (1),(c).

Ways to meet our expectations:

  • Apply classification to personal information based on its sensitivity and set the corresponding retention periods for each category.
  • Document the retention schedule based on business need, statutory requirements and other principles. 
  • Provide sufficient information in the schedule to allow you to identify all records and put disposal decisions in to effect.
  • Standardise and document planned weeding activities and ensure they occur on an ongoing or regular basis (eg a process of rolling deletion of information).
  • Delete or destroy all personal information held within AI systems in line with the retention schedule. If it is not possible to permanently delete the information (due to system functionality restrictions), store it securely 'out of reach' and lock down access, or anonymise it.
  • Remove or erase training data that is no longer required (eg because it is out of date and no longer predictively useful).
  • Document the justification for any decision to keep personal information outside the retention period. 
  • Consider reproducibility (ie being able to reproduce the results at a later time, but being unable to do so as the original information has been deleted).

Options to consider:

  • Review the retention schedule regularly to make sure it is accurate and comprehensive.
  • Designate responsibility for retention and disposal to an appropriate person (this could be centrally or in each department, eg Information Asset Owners).