Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Waldo, Jim (2017)
Types: Conference object
Subjects: privacy, big data, k-anonymity
Big Data Science, which combines large data sets with techniques from statistics and machine learning, is beginning to reach the social sciences. The promise of this approach to investigation are considerable, allowing researchers to establish correlations between variables over huge numbers of participants using data that has been gathered in a non-invasive fashion and in natural settings. Unlike large-data projects in the physical sciences, however, use of these data sets in the social sciences require that the subjects generating the data be treated in a fair an ethical fashion. This is often taken as requiring either compliance with the common rule, or that the data be de-identified to insure the privacy of the subjects. But de-identification turns out to be far more difficult than one might think. In particular, the ability to re-identify subjects from a set of attributes that can be linked to other data sets has led to a number of mechanisms, such as k-anonymity or l-diversity, that attempt to define technical solutions to the deidentification problem. But these mechanisms are not without their cost. Recent work has shown that de-identification of a data set can introduce statistical bias into that data, making the results extracted by analysis of the de-identified set vary significantly from those same analyses applied to the original set. In this paper, we will look at how this bias is introduced when a particular form of de-identification, kanonymity, is applied to a particular large data set generated by the Massive Open On-line Courses (MOOCs) offered by Harvard and MIT. We will discuss some of the tensions that arise between privacy and Big Data science as a result of this bias, and look at some of the ways that have been proposed to avoid the trade-off between accurate science and privacy. Finally, we will outline a promising new approach to de-identification which appears to avoid much of the bias introduction, at least on the data set in question.
  • No references.
  • No related research data.
  • No similar publications.

Share - Bookmark

Download from

Cite this article

Collected from