- What is federated learning and what does it do?
- How does FL assist with data protection compliance?
- What do we need to know about implementing federated learning?
- What are the risks associated with using FL?
Federated learning (FL) is a technique that allows multiple different parties to train AI models on their own information (‘local’ models). They then combine some of the patterns that those models have identified (known as “gradients”) into a single, more accurate ‘global’ model, without having to share any training information with each other. Federated learning has similarities with SMPC. For example, the processing involves multiple entities. However, FL is not necessarily a type of SMPC.
There are two approaches to federated learning: centralised design and decentralised design.
In centralised FL, a co-ordination server creates a model or algorithm, and duplicate versions of that model are sent out to each distributed data source. The duplicate model trains itself on each local data source and sends back the analysis it generates. That analysis is synthesised with the analysis from other data sources and integrated into the centralised model by the coordination server. This process repeats itself to constantly refine and improve the model. A trusted third-party, is a requirement when using centralised federated learning, which is not the case when using SMPC.
In decentralised FL, there is no central co-ordination server involved. Each participating entity communicates with each other, and they can all update the global model directly. The decentralised design has some advantages because processing on one server may bring potential security risks or unfairness and there is no single point of failure.
FL can help with data protection compliance in several ways, including:
- minimising the personal information processed during a model’s training phase;
- providing an appropriate level of security (in combination with other PETs); and
- minimising the risk arising from data breaches, as no data is held together in a central location that may be more valuable to an attacker.
FL also can reduce risk in some use cases, but the addition of other PETs further mitigates the risk of attackers extracting or inferring any personal information.
While decentralised FL can be cheaper than training a centralised model, it still incurs significant computational cost. This may make it unusable for large-scale processing operations. You should consider whether the training and testing time and memory usage is acceptable for your aims. This will depend on the scale of the processing and will increase proportionally as the size of the dataset increases.
You should also consider:
- the choice of encryption algorithm to encrypt local model parameters;
- the local model parameter settings to be specified when reporting the training or testing time and required memory; and
- analysing the FL algorithm to determine its resource usage, so that you can estimate the resource requirements.
When you use FL techniques, local machine learning (ML) models can still contain personal information. For example, the models may preserve features and correlations from the training data samples that could then be extracted or inferred by attackers.
The information shared as part of FL may indirectly expose private information used for local training of the ML model. For example, by:
- model inversion of the model updates;
- observing the patterns that those models have identified (known as ‘gradients’); or
- other attacks such as membership inference.
The nature of FL means the training process is exposed to multiple parties. This can increase the risk of leakage by reverse engineering, if an attacker can:
- observe model changes over time;
- observe specific model updates (ie a single client update); or
- manipulate the model.
To protect the privacy of your training dataset and local model parameters that are exchanged with the co-ordination server, you should combine FL with other PETs. For example, you could use:
- SMPC to protect parameters sent from the clients to ensure that they do not reveal their inputs. For example, the Secure Aggregation protocol (a form of SMPC), has already been integrated into Google’s TensorFlow Federated framework;
- homomorphic encryption to encrypt local model parameters from all participants. The coordination server receives an encrypted global model that can only be decrypted if a sufficient number of local models have been aggregated;
- differential privacy, to hide the participation of a user in a training task. If a model depends on the information of any particular person used to train it, this increases the risk of singling them out. You could use differential privacy to add noise and hide the fact that you used a particular person’s information in the training task. This makes it less certain which data points actually relate to a particular person. This is more effective if the number of people in the dataset is large; and
- secure communications protocols (eg TLS) between clients (in the decentralised model) and between clients and the server (in the centralised model) to prevent man-in-the-middle attacks, eavesdropping and tampering on the connection between the clients and co-ordination server.
Further reading – ICO guidance
See the ‘How should we assess security and data minimisation in AI’ section of our guidance on AI and data protection for further information on security risks posed by AI systems and available mitigation techniques.