Effective anonymisation of qualitative data
In research, the anonymisation of qualitative data allows the data to be shared (through publications and data sharing services) while preserving the privacy of the research participants.
GDPR legislation
All research data relating to human participants in the UK is subject to the terms of the General Data Protection Regulations (GDPR).
Under the terms of GDPR, researchers have a legal duty to safeguard personal data, which is any data which makes a person identifiable and includes data about an individual's ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, health, sex life or orientation, genetic and biometric data, and online identifiers.
GDPR legislation requires researchers to effectively safeguard data; safeguards include technical and organisational barriers to access, like an encryption, authentication requirements and user licences, or applying anonymisation that would 'no longer permit the identification of data subjects'.
GDPR legislation ceases to apply once a dataset has been effectively anonymised. What counts as 'anonymised' is measured by a 'likely reasonably' test. This means that if, on the balance of probabilities, third parties cross-referencing 'anonymised' data with information or knowledge already available to the public cannot identify individuals, then data is not personal.
GDPR legislation does allow personal data to be revealed where explicit consent has been given.
Effective anonymisation
Our ethical duties as researchers, to protect our participants, extend beyond the legal duties imposed by the GDPR legislation.
It is recognised that the process of anonymisation changes the data and there is often a conflict between maintaining the integrity of the dataset (to prevent important details being missed or incorrect inferences being made) and maintaining the anonymity of participants.
A person's identity can be disclosed from:
- direct identifiers such as names, postcode information or pictures
- Indirect identifiers which, when linked with other available information, could identify someone.
Anonymising research data is best planned early in the research and should be considered as part of the process for obtaining informed consent, which should include explicit consent for planned data sharing or imposing any necessary access restrictions.
Personal data should never be disclosed from research information, unless a participant has given consent to do so, in writing (assuming consents are collected in writing).
Researchers should think about the level of anonymity they need to achieve with their data in order to maintain the integrity of the dataset and yet preserve the privacy of their participants. Knowing which data you wish to collect will help you to create an effective anonymisation strategy which is consistent across your dataset and generates ethical data which can be reused without contravening data protection laws. For example, does administrative data (names and addresses for example) need to be collected and if it does, can it be separated from the research data and destroyed at an early stage of the research process? Thinking in advance about what may or may not be recorded/transcribed can be a much more effective way of creating data that accurately represent the research process and the contribution of participants.
Researchers should also note that different levels of anonymisation may be required for different forms of dissemination - for example, what is written up in a journal paper (which is public but fairly private) may be different to what you would include on a blog to engage the public, which in turn may be different to how you make the data available for sharing in a thesis (where you can impose access and licence restrictions).
ANONYMISATION CHALLENGES
The process of anonymisation is complex, and far from water-tight. In many cases changing people’s names or disguising locations can be the first steps in a more nuanced process around managing 'identifying details'. Anonymity is a continuum (from fully anonymous to very nearly identifiable) and the researcher needs to find a sensible balance between the risk of identification (including through indirect identification) and the needs of the research.
Where data has not been fully anonymised (including from indirect identification) then explicit consent needs to be obtained before this data can be put in the public domain. If this consent is not obtained at the point of data collection it must be obtained before the data is put into the public domain.
One of the most difficult aspects of anonymising qualitative data relates to indirect identification of participants by people who are known to them, especially where the research involves (for example) interviewing a small group where all the participants are known to each other. These issues can be exacerbated if there are power relationships at play within this small group (for example a team and team leader being interviewed about their working lives).
In these cases, the researcher needs to be honest with the participants about the risk of disclosing their identities, and respectful of their participants' wishes. Where possible processes such as smoke screens (e.g. where multiple pseudonyms are used for the same participant, direct quotations are re-phrased, and/or links between participants are known to each other are broken) should be used to help protect against indirect identification. These approaches can often be used in publication without damaging the underlying dataset (which can be shared under restrictions).
However there will be cases where this is not a feasible approach and the researcher needs to think carefully about how to obtain consent to potentially disclose an individual participants identity without inadvertently disclosing the identities of other participants as part of this process.
The outcomes of this process should be written up in the data management plan and/or the ethics application for the project as appropriate.