In my last blog series, I discussed Privacy by Design (PbD) and its effect on the new European General Data Protection Regulation, or GDPR.
This blog post expands on that topic by explaining the first step in data anonymization, which is to find personally identifiable information within the data you work with.
The blog series on Privacy by Design (PbD) and the GDPR have inspired me to take a more practical look at the application of PbD.
I’ve already discussed the tool Mockaroo in my blog on Privacy Enhancing Technology, but I want to go further. Is it possible to see which data is Personally Identifiable Information (PII)?
Data privacy is a key thing these days in DevOps.
New privacy legislation like the GDPR has shaken up the DevOps world regarding data processing and storage. The GDPR bluntly states that data which is not necessary for the organization collecting this data (the ‘data collector’), may NOT be processed, and has to be eliminated immediately.
That’s already a big burden for the data collector, but what about processing the data necessary for the data collector? If the data is personally identifiable (PII), it may not be processed, so pre-processing has to take place.
One of these possible pre-processing steps is data anonymization.
Googling ‘data anonymization,’ I found an OWASP page discussing this process, with the mention of some appropriate tools.
So, now we have a set of tools for data anonymization. But let’s define what data anonymization actually means.
According to OWASP, data anonymization consists of techniques for data processing, and procedures for handling the data, algorithms, keys, and lifecycle of the data.
Examples of such techniques are:
- Replacement – substitute identifying numbers
- Suppression – omit from the released data, partially or fully
- Generalization – replace value attribute with something less specific
- Perturbation – make random changes to the data
The OWASP site also mentions some tools to use for data anonymization.
ARX is one of them, so let’s take a look at that.
ARX is open source software for anonymizing sensitive personal data so that no personal identification can take place.
The following illustration shows how ARX works:
It goes beyond the scope of this blog post to explain the whole process. Regarding data privacy, I will focus on the privacy model.
if you want more information about ARX, the developers have created a great video on its functionality.
The privacy model: k-anonymity
One of the privacy models ARX uses is k-anonymity.
A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release.
I will explain this with an example.
The following table is a non-anonymized database consisting of patient records from a fictitious hospital in the United States.
There are six attributes and 10 records in this data.
* This is a modified table, based on this k-anonymity example.
To achieve k-anonymity, we can apply suppression and generalization.
For this dataset, we will use generalization by categorizing the attribute Age.
Next, this suppression is applied to the columns ‘Name’ and ‘Housing’ by replacing all values with ‘*’
When viewing this table, do you notice something?
Take a good look at the attributes Age, Gender and State. For any combination of these attributes found in any row of the table, there are always at least two rows with those exact attributes (2-anonymity).
Have a look at the previous table that reveals the information for Johnson and Ryan or Joan and Petra. These attributes are called the quasi-identifiers and are very important in data privacy.
With the help of these quasi-identifiers, a person can identify a person based on only these attributes.
This is just a small dataset, and you can use k-anonymity to find quasi-identifiers quite easily.
But what will you do when you have to analyze a much larger set with more than 100 rows?
In this case, you can use a tool like the already mentioned ARX, but be careful with high-dimensional data, because you have a lot of attributes per individual, so finding suitable quasi-identifiers in these cases is very hard.
Data anonymization is an important part of applying Privacy by Design in DevOps— pre-processing the data so no Personally Identifiable Information (PII) can be found in the dataset.
K-anonymity is a method used in data anonymization to find quasi-identifiers, a set of attributes someone can use to identify a person from a dataset. Within this blog, this has been illustrated with some fictional hospital data.
With a tool like ARX, you can apply k-anonymity on datasets, but be careful with high-dimensional data.