Open Data, Open Citizens?

Open data initiatives emphasize transparency.  But government data often includes personal information.  What happens when government open data initiatives clash with privacy?  And are efforts to scrub open data of personal information sufficient to address privacy concerns?  In this project, CIPPIC investigates the potential conflict between open data and privacy.

CIPPIC Report:  Open Data, Open Citizens?

CIPPIC Podcast:  Open Data and Privacy

CIPPIC FAQ:  Open Data and Privacy

 

Summary of Project

Open Data

The Open Knowledge Foundation defines “Open Data” as data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.  The Canadian government refers to Open Data as data that is freely accessible to the public and published in a manner in which anyone can use and manipulate without restriction. For data to be open it must: (a) be available in whole, and easily accessible (i.e downloadable or easily requested); (b) be re-useable and redistributable; and (c) be universal (i.e anyone can access, re-use, or redistribute).  Generally to be considered  open data it must be released in a machine readable format under an open, permissive license, with the intention that the data be used, combined, re-released and built upon by others.

Open Data and Privacy

Governments in Canada collect an enormous amount of personal information about individuals.  This information is often collected in datasets that could be released as open data.  However, governments are obliged by privacy laws to avoid disclosing personal information except for authorized purposes.

Canada’s strategies to maintain privacy are not explicitly described in the federal government’s open data policies. No new legislative framework has been created to specifically guide the move towards open government in Canada

Anonymization is the act of destroying all links between de-identified datasets (datasets in which people’s identities are removed to prevent the data from being used to identify an individual) and original datasets. Data are labelled as anonymous once all PII is removed. The term is used to imply that the data can no longer be re-identified. 

Anonymization strategies include: replacing data with other variables (names for numbers); suppressing/omitting data from a set; generalizing data used (specific dates become general years); or perturbing the data by making random changes. Each treatment accords varying levels of protection/risk reduction from attackers.  Data anonymization is not foolproof.  There are many examples of attackers defeating anonymization efforts.

Privacy Risks of Open Data

There is a tension between utility and anonymity:  data can often be either useful or anonymous, but rarely both. Therefore, since people expect open data to be useful, there are many privacy risks. First, when combined with other datasets, anonymous data can be re-identified. Second, personal data can be directly released in “anonymized” datasets, though often accidently. Finally, third parties can use their own data to re-identify anonymous open data.

Re-identification occurs when individuals are identified from information gleaned form a supposedly anonymized data set.  Re-identification is successful when a hacker is able to find hidden PII in a data set, or when two, or more, data sets are combined together to identify people. This involves the use of anonymous data sets with non-anonymous data sets (for example, public voter lists) to re-identify individuals present in the anonymous data.

Conclusion

The privacy risks associated with open data are real, but should not be overstated.  There are real benefits to governments pursuing open data strategies.  The solution to this conflict between transparency and privacy lies in putting in place processes that recognize the real privacy risks and harms associated with open data and takes appropriate steps to address them.  

The privacy risks associated with open data are real, but should not be overstated.  There are real benefits to governments pursuing open data strategies.  The solution to this conflict between transparency and privacy lies in putting in place processes that recognize the real privacy risks and harms associated with open data and takes appropriate steps to address them.  We suggest these strategies may be broken down into “pre-release” and “release” stages.

Pre-release strategies include:

1.    Data minimization:  governments should collect as little personal information as is actually required to achieve the objective of the collection.

2.    Adopting processes that include a privacy impact assessment of every data set destined for release as open data.  These assessments should consider both the likelihood of data being associated with individuals and the potential impact of that association.  The more sensitive the data, the more care to be taken ; and

3.    Standardizing these assessments in checklists that are formalized across the board as part of the open data release process.

Release strategies include:

1.    Adopt state of the art anonymisation strategies that evolve over time as the field matures.  In our view, this may be a centralized service:  maintaining standards of anonymisation may prove a difficult task across all departments.

2.    Where sensitive privacy values may be implicated by reidentification techniques, consider ejecting the data set from open data publication.  Where open government values compel the release of the data, risk of privacy harm may suggest publication in a format that does not lend so readily to reidentification.  Formats such as paper or, if digital, .pdf, do not lend themselves so readily to reuse.

This project was made possible through a grant provided by the Contributions Program of the Office fo the Privacy Commissioner of Canada.