Private set intersection (PSI)

What is private set intersection (PSI) and what does it do?
How does PSI assist with data protection compliance?
What are the risks associated with using PSI?

What is private set intersection (PSI) and what does it do?

PSI is a specific type of SMPC that allows two parties, each with their own dataset, to find the “intersection” between them (ie the elements the two datasets have in common), without revealing or sharing those datasets. You could also use it to compute the size of the intersection or aggregate statistics on it.

The most common type of PSI is the client-server subtype, where only the client learns the PSI result. The client can be the user of a PSI service or the party who will learn the intersection or intersection size (the number of matching data points between the two parties), depending on the purposes. The server hosts the PSI service and holds information that the client can query to determine if it holds any matching information with the server.

PSI can work in two ways:

the parties interact directly with each other and need to have a copy of their set at the time of the computation, known as “traditional PSI;” or
the computation of PSI or the storage of sets can be delegated to a third-party server, known as “delegated PSI.”

The most efficient PSI protocols are highly scalable and use a variety of methods, including other privacy enhancing techniques, such as hashing or homomorphic encryption.

How does PSI assist with data protection compliance?

PSI can help to achieve data minimisation as no information is shared beyond what each party has in common.

PSI offer the same benefits as other SMPC protocols, such as:

no single party being able to have a ‘global view’ of all combined identifiable input data from both parties;
the parties involved in each stage of the processing receiving the minimum amount of information tailored to their requirements, preventing purpose creep; and
with cryptographic expertise, PSI protocols can be modified to show only anonymous aggregate statistics from the intersection, depending on the requirements of the sharing.

Example – Using private set intersection

Two health organisations process personal information about people’s health.

Organisation A processes information about people’s vaccination status, while Organisation B processes information about people’s specific health conditions.

Organisation B needs to determine the percentage of people with underlying health conditions who have not been vaccinated.

Ordinarily, this may require Organisation A to disclose its entire dataset to Organisation B so the latter can compare with its own. By using PSI, it does not need to do so. In fact, both organisations can minimise the amount of personal information they process, while still achieving their purposes.

A third party provides the PSI protocol. The computation involves processing the personal information that both organisations hold. However, the output of that computation is the number of people that are not vaccinated who have underlying health conditions. Organisation B therefore only learns this and does not otherwise process Organisation A’s dataset directly.

This minimises the personal information needed to achieve the purpose. Therefore, it enhances people’s privacy.

What are the risks associated with using PSI?

PSI introduces some risks that you must mitigate. These include:

risks of re-identification from inappropriate intersection size or over-analysis;
the introduction of a third party to the processing when using PSI may increase data protection risks to people if it is compromised, and
the potential for one or more of the parties to use fictional data in an attempt to reveal information about people.

You should consider whether you can incorporate differential privacy into the PSI protocol to prevent the risk of singling out of the output – providing the output is sufficiently useful to fulfil your purposes. This approach is generally less error prone than trying to manually constrain protocol parameters to rule out specific attacks.

You should choose an appropriate intersection size. This is because a low intersection size may allow the party computing the intersection to single out people within that intersection in cases where a person’s record has additional information associated with it (eg numerical values for hospital visits). These values can be added together and used for publishing aggregates (known as the intersection sum).

If an identifier has a unique associated value, then it may be easy to detect if that identifier was in the intersection by looking at the intersection sum and whether one of the identifiers has a very large associated value compared to all other identifiers. In that case, if the intersection sum is large, it is possible to infer that that identifier was in the intersection.

The intersection sum may also reveal which identifiers are in the intersection, if the intersection is too small. This could make it easier to guess which combination of identifiers could be in the intersection in order to obtain a particular intersection sum. You should therefore decide on an appropriate “threshold” for intersection size and remove any outliers to mitigate this risk.

Once you agree an intersection size, you could set the computation process to automatically terminate the PSI protocol, if it is likely to result in a number below this. Additionally, halving the size of the intersection, as well as the size of the inputs, could provide additional mitigations.

Re-identification can also happen due to over-analysis. This involves performing multiple intersection operations that may either reveal or remove particular people from the intersection. In other words, this can lead to re-identification through singling out. Rate-limiting can be an effective way of mitigating this risk. You should define this type of technical measure in any data sharing agreement.

Some PSI implementations may not ensure input is checked (ie that parties use real input data as opposed to non-genuine or fictional information). Others may not prevent parties from arbitrarily changing their input after the computation process begins.

This is an issue because it allows a malicious party to reveal information in the intersection they do not actually have mutually in common with the other party. If it is personal information, there is a risk that the malicious party could access sensitive information that may have detrimental effects to people.

You could mitigate this risk by ensuring that the inputs are checked and validated, and independently audited.

If you and other organisations use PSI to match people from your separate databases, you must also maintain referential integrity to ensure each record is matched accurately. Linking across datasets becomes more difficult if the information is held in a variety of formats. There may be a risk that some people are not included or included by mistake. It is possible to reduce the risk of inaccurate matching by a number of techniques, including tokenisation and hashing. For example, if a common identifier is hashed by both parties, then the hashes will only match if the information is an exact match for both parties.

Your choice of using a third party will depend on whether it is likely to reduce the risk in comparison to using a two-party protocol. When performing a DPIA, you should document the risks and choose the most suitable option for mitigating the risks for your circumstances. If you are using a third party for the computation or storage of sets, you must ensure appropriate technical and organisational measures are in place to mitigate the risk of any personal information being compromised.