Reassembling Our Digital Selves
Deborah Estrin & Ari Juels
Abstract: Digital applications and tools that capture and analyze consumer behaviors are proliferating at a
bewildering rate. Analysis of data from large numbers of consumers is transforming advertising, generating
new revenue streams for mobile apps, and leading to new discoveries in health care. In this paper, we consider
a complementary perspective: the utility of these implicitly generated data streams to the consumer.
Our premise is that people can unlock immense
personal value by reassembling their digital traces,
or small data, into a coherent and actionable view of
well-being, social connections, and productivity. The
utility of reassembling the self arises in diverse con-
texts, from wellness to content-recommendation sys-
tems. Without design attention to the unique char-
acteristics of small data, however, the image that these
data provide to individual users will be, at best, like a
cubist portrait: a fragmented picture of the self.
Management of small data presents fundamental
design questions regarding the “who, what, and
where” of access rights and responsibilities. The
blend of competing and cooperating entities hand ling
small data breaks down distinctions such as that
between shared and private, and renders questions
like whose data are they? hard to answer. Conceptual
boundaries blur further as data increase in sensitivity
and become “activated,” such as when per sonal apps
process and fuse longitudinal data streams to drive
context-rich personalization algorithms on the con-
sumer’s behalf.
We explore this confusing landscape by drawing
attention to three critical design objectives: program –
matic access to the digital traces that make up small
data, activation of small data for personal applica-
tions, and creating privacy and accountability mea –
sures for the apps and services consuming small da-
ta. We point out the limitations of existing perspec-
tives on both data ownership and control, and on
privacy mechanisms, such as sanitization and en-
© 2016 by the American Academy of Arts & Sciences
doi:10.1162/DAED_a_00364
DEBORAH ESTRIN, a Fellow of the
American Academy since 2007, is
Professor of Computer Science at
Cornell Tech and Professor of
Healthcare Policy and Research at
Weill Cornell Medical College.
ARI JUELS is Professor at the Ja –
cobs-Technion Cornell Institute at
Cornell Tech.
(*See endnotes for complete contributor
biographies.)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
43
Reassem-
bling Our
Digital
Selves
cryption. Rather than at tempting to pro-
vide answers, we pose key questions that
should inform new system designs.
The term big data expresses the potential
of extracting meaning and value using data
sets covering large numbers of people, or a
large n. Big data’s humbler counterpart,
small data, promises to be equally transfor-
mative, allowing individual users to har-
ness the data they generate through their
own use of online and mobile services; or,
in other terms, when n = 1.1
The explosion of data about individuals
is no secret. Personal data sources include:
con tinuous activity and location data
sourced from mobile devices and wear-
ables; urland click data from online search-
es; text and voice data from social and per-
sonal communications (emails and posts,
texts and tweets); photographs both taken
and viewed; entertainment preferences and
con sumption; daily purchases made online
and offline; personal records from digital
ed ucation ½les and medical systems; trans –
portation preferences and patterns; and
emerging sources such as genomic data, im –
planted medical devices, and wearables.
In this paper, we discuss the promise and
problems associated with small data. We use
the term small data to refer to the digital
traces produced by an individual in the
course of her daily activities, which she can
use not to understand general trends across
a population, but to understand herself.
All of the bene½ts of small data require a
reassembly of the self, a partial to compre-
hensive drawing together of diverse small
data sources pertaining to the individual.
While there arise many technical challenges
related to data standards, storage, and com –
putation, we focus on the issues in greatest
need of architectural attention: access, acti-
vation, privacy, and accountability.
When should a person have program-
matic access to the digital traces he gener-
ates, along with the capability to activate
these data through applications and services
of his choosing? Several examples high light
how small data, as a complement to big data,
promises powerful new insights and op-
portunities:
1) Custom small-data analytics. Consider a
mobile health application that guides a pa-
tient through preparation and recovery
from hip surgery. Someday, such an app
could analyze her daily walking patterns,
provide predictive analytics on her recov-
ery time, and engage her in the physical
therapy regimen that best matches her
unique med ical situation. Such approaches
are expand ing beyond their origins in the
quanti½ed-self movement into broad-based
health man agement practices.2 The per-
spective of small data, rather than big data,
will provide not only global insights that
lead to new therapies, but personalization
of these ther apies to the patient, time, and
place.
2) Rich user-modeling to facilitate social ser –
vices. The quanti½ed self has a natural coun-
terpart in the quanti½ed student. For ex-
ample, a teacher or tutor could gain great
insight from a synthesis of individual stu-
dents’ detailed analytics, as captured by pat –
terns in their online consumption of lectures
and readings, or in their online input dur –
ing homework exercises and examinations.
Similar functionality could enrich relation –
ships between mentors and mentees, coach-
es and clients, and provide crucial support to
those whose job it is to safeguard the well-
being of teens in the foster system.
3) Service and product personalization. Rich
user-modeling is equally relevant to ser –
vice personalization, recommendation sys-
tems, and advertising. Popular online plat-
forms like Amazon and Netflix and sharing-
eco nomy services like Uber and Airbnb,
are largely informed by a very narrow set
of data available to them, either directly or
through third-party acquisition. Imagine
the immersive recommendation systems
44
Dædalus, the Journal ofthe American Academy of Arts & Sciences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
that could be built by drawing on users’ full
suites of data, from online retail and ser –
vice transactions to location and mobile
communication data.
This direction could continue to be pur-
sued strictly as a big data play to sell more
products and services to targeted custom –
ers, such that utility is measured in terms
of sales ½gures. However, we have already
seen signs of customer pushback against
the perceived “creepiness” of platforms
mining personal data to boost sales. If in-
dividuals can, instead, demonstrably ben –
e½t from personalization on their behalf–in
other words, if utility is instead shown in terms
of small data bene½ting the individual–then
“getting it right” can advance whole indus-
tries beyond contention with consumers.
(4) Enriching the arc of individual-to-com-
munity knowledge. Individuals share data with
communities to accumulate shared knowl –
edge and a collection of experiences. Small
data streams, contributed by individual
users could, for instance, amplify the great
success of manual data entry for sites such
as PatientsLikeMe and Inspire, which help
patients and caregivers understand and nav –
igate the choices and challenges of speci½c
medical conditions.3 The small data per-
spective also points to a path for this collec-
tive knowledge to return to the individual in
the form of moment-to-moment guidance.
Knowledge and predictions about matters
from food allergies to triggers of seizures
can be mapped continuously onto an indi-
vidual’s small data from a bank of collec-
tive experience.
These potential bene½ts are uncontro-
versial. But controversy arises and design
focus is most needed when we consider an
individual’s access to her own small data.
In order to realize the bene½ts inherent in
the above examples, a consumer needs to
have access to her own digital traces, and
also needs to be able to activate them, such
as by unlocking them in one piece of soft-
ware and making them available for use in
another. It might seem self-evident that this
com bination of access and activation of
one’s own data is an imperative, even a uni –
versal right. But it is not.
Do you have an irrevocable right to your
own physiological data? It is hard to imagine
an answer other than yes. But most ½t ness
track ing devices and mobile apps do not give
users direct access to raw data on their phys –
iological measures, such as number of steps
taken, skin temperature, body weight, speed
of food intake, and heart activity. Instead,
users must upload the data to a device- or
soft ware-maker’s service for analysis and
display. Users often cannot download or ex-
port this raw data because the makers of ½t –
ness devices and apps frequently rely on
business models that exploit control of their
users’ data and outline terms of use that
claim broad rights to user-generated data.4
Tensions around the rights of the indi-
vidual to her physiological data are not new.
But in the past these concerns primarily af-
fected the small segment of the population
with implantable medical devices, such as
pacemakers and insulin pumps.5 Now, these
issues of physiological data use and owner –
ship impact every user of a mobile phone,
smart watch, or ½tness-device.
One complication is the fact that the phys –
iological data recorded by apps are created
not just by an individual, but in collaboration
with an app. The mobile app that records
your footstep is, in fact, a collaboration be-
tween your body, which produces the mo-
tion, and the accelerometers in your mobile
device, which detect it. Their outputs are
then translated by the app (or a cloud ser –
vice) into human-consumable data, like pe –
dom eter readings. Moreover, the model that
translates the data most likely bene½ts from
data from other users, further complicating
the issue!
Joint creation is a widespread feature of
small data. If a user interacts with a service
provider’s content–such as when buying a
Deborah
Estrin &
Ari Juels
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
145 (1) Winter 2016
45
Reassem-
bling Our
Digital
Selves
video online or posting a comment on the
provider’s site–ownership and control of
the resulting data, transaction, or text can be
a complicated matter. Activities like taking
group photos, videoconferencing, and gath –
ering nutrition data on shared meals all re-
sult in the joint creation of small data.
As the amount, variety, and multiplicity
of stakeholders in small data balloon, the
questions of rights and control become in-
creasingly complicated.
If people are going to have ready access to
sensitive information about themselves,
what platforms and methods will support
the privacy and accountability needed for
apps and services fueled by these data?
Small data can furnish powerful insights, for
good and ill. So it is essential that the tech-
nical community start to develop mecha-
nisms and build products that allow users to
access and activate their small data, while
protecting them from abuses in digital and
com mercial ecosystems far too complex for
them to reason through, let alone manage.
Today’s exploration of privacy violations
foreshadows tomorrow’s challenges in small
data protection. Consider this example: On
July 8 at 11:20 a.m., Olivia hailed a taxi on Varick
Street in the West Village in Manhattan. An
eleven-minute ride brought her to the Bowery
Hot el. She paid $6.50 for the ride. She did not tip.
In the future, small data elements gathered
in a narrative like this will be generated by a
con stellation of devices carried by the user,
and by her supporting services. A data-rich
pay ment ecosystem based on nfc (near
½eld com munication)-enabled devices will
create a record of the payment and harvest
ride de tails automatically. These data will
then feed into personal applications such as
automat ed diaries, personal expense re-
ports, and time-management aids.
Though these hypothetical personal ap-
plications do not yet exist in the market,
this type of small data is generated every
minute and has already contributed to doc –
umented failures of data protection and
con trol. The taxi ride cited above is real: the
ac tress Olivia Munn traveled from Varick
Street to the Bowery Hotel in 2013. In 2014,
an enterprising researcher chose to mine
pub lic data, and published ½ndings of pub-
lic in terest. The researcher had ½rst noticed
that pub licly posted photos of celebrities
enter ing and exiting New York City taxis
often show legible taxi medallion num-
bers.6 Though the government of New York
City does make data on individual taxi rides
publicly available, it takes care to conceal
medallion numbers to protect riders and
drivers. Unfortunately, in one large data set,
the city implemen ted this protection inef-
fectively–through misuse of a hash func-
tion–making it possible to associate ride
infor mation with speci½c medallion num-
bers.
Ridesharing services like Uber are yet an –
other way that these types of data are being
gen erated. Thanks to its use of user-gen –
erated location data for pickups, Uber has
transformed the use of small data in urban
transportation. Like many other shared-ser –
vice providers, the company is blurring the
boundary between customer data, which is
used to generate sales, and small data, or per –
sonal information, which is used to bene½t
the user. One could imagine Uber consum-
ing additional user-generated data, such as
its users’ personal calendars, in order to pro –
vide more convenient –and powerful–ser-
vices. A dark facet of Uber’s convenience is
the “God view,” a (once secret) viewing
mode available to Uber employees to track
any user. Uber has purportedly used the
God view to harass journalists who have
written critically about the company.7 In
2012, Uber infamously published a blog
post that tracked what the company called
“rides of glory”: rides whose timing seemed
to in dicate passengers had engaged in one-
night stands.8 Given that Uber is generat-
ing at least a portion of this personal data,
the ques tion arises: should individual users
46
Dædalus, the Journal ofthe American Academy of Arts & Sciences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
have the ability to delete personal data stored
with such services, or should they learn
about how their data are used in order to
hold ser vice providers like Uber account-
able for abuses?
While these particular privacy violations
may not be of great concern to the general
public, they illustrate the principle that per –
son al data do not always originate directly
with the user. More and more, personal
da ta can be sourced from many different
places and can emerge unpredictably. When
personal information is turned into small
data and made available for individual
bene½t, it comes burdened with complex
provenance; thus, consumers will struggle
to con trol small data they perceive as
“theirs.” Moreover, as small-data use trans-
forms life-alter ing, positive realms, such as
health care and education, the hazards and
conflicting inter ests involved in data cre-
ation could bring additional serious issues
to the fore, including data entanglement
and data integrity.
There are important limitations to exist-
ing designs and models for privacy and con –
trol. Several existing approaches to data pro –
tection, such as sanitization, cryptography, and
ownership assignment, do not address the per –
spective of small data used by and for the
individual. Sanitization is, very broadly
speaking, the practice of redacting, aggre-
gating, or adding noise to a collection of da-
ta to prepare it for safe release in a privacy-
sensitive context. This is the approach that
the New York City government took to pre –
vent its taxi-ride data from being used to
identify customers; it replaced medallion
num bers with cryptographically construct –
ed pseudonyms. As that example shows,
data sanitization can be a fragile process.
One mistake or unanticipated correlation
can lead to unwanted data disclosures.
Another problem with sanitization is
the trade-off between privacy and utility.
Generally, with an increase in utility comes a
decrease in privacy. This tension was strik –
ingly demonstrated by a data set from six –
teen moocs (massive open online courses)
run by mitX and HarvardX on the edX
platform.9 To comply with a federal stat –
ute known as the Family Educational Rights
and Privacy Act (ferpa), scientists “de –
identi½ed” the data set, using a privacy
measure called “k-anonymity.” Subsequent –
ly, these data sets were widely studied by
researchers. However, the scientists who
prod uced the data set also discovered that
the sanitized data differed in marked ways
from the original data set. For instance, in the
deidenti½ed data set, the percentage of cer –
ti½ed students, or those who successfully
completed courses, dropped by nearly one
half from the true data set. In this case, pro –
tecting privacy could have the drawback of
invalidating studies meant to improve in-
struction quality for students.
In the case of small data, the privacy-
utility trade-off is particularly problemat-
ic, though not unique to it. There are many
big-data analyses, such as medical studies,
that can be done more or less safely using
sanitized data.10 Sanitization, however, of-
ten does not scale down to the protection
of small data: it is not possible to hide an
individual’s data within a crowd’s when the
utility of the data stems from its successful
integration with other data pertaining to
that individual. This problem is illustrated
by a study of personalized medicine in
which re searchers examined estimates of
stable do sages for warfarin, an anticoag –
ulant medication, that were made using
patients’ genetic markers.11 Researchers
demonstrated that in the standard model
for such recommendations, a patient’s es-
timated stable dose of warfarin leaks in-
formation about his genetic markers. San-
itizing the dose data–in other words, pre-
venting leakage of gen etic information by
using standard privacy-protecting tools
within the model –does not work.12 The
model consumes a tiny amount of infor-
Deborah
Estrin &
Ari Juels
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
145 (1) Winter 2016
47
Reassem-
bling Our
Digital
Selves
mation (only two genetic markers), and
the information is only sourced from one
individual. Further, the cost of strong sani-
tization could be fatal. Degrading the ½ –
delity of the model could result in inaccu-
rately estimated stable war farin dosages,
which could very likely cause patient
deaths.
There is little motivation for sanitization
when data are consumed by the individual
who produced them, as is sometimes the
case for small data. But given how many op-
portunities now exist for sharing small data,
it would be natural to appeal to sanitization
as a privacy-preserving tool.
Another technical approach to enforcing
data con½dentiality is the use of cryptogra-
phy, particularly encryption. Take the exam –
ple of medical data, also known as protected
health information (phi), which is a partic –
ularly sensitive form of small data. The
federal Health Insurance Portability and
Accountability Act (hipaa) promotes en-
cryption of such data. Organizations that
properly encrypt data and store keys can,
in the case of a breach, claim safe harbor
status and bypass breach noti½cations.
When properly deployed today, encryp-
tion is very robust: a standard algorithm,
such as the Advanced Encryption Standard
(aes), cannot be broken even by a powerful
adversary. At ½rst glance, properly imple-
mented encryption seems like a cure-all for
con½dentiality issues.
But encryption, like sanitization, acts at
odds with utility. Encrypted data cannot be
computed on. (Theoretical and application-
speci½c approaches to computing on en-
crypted data exist, but have limited utility
in practice.) A system must have access to
data in order to process it, and thus, if pre –
sented with encrypted data, must be able to
decrypt it. Further, if a system has access to
the encrypted data, then so, too, does an at –
tacker that breaches the system or steals cre –
d entials, such as passwords, from a person
with access. While encryption is an allur-
ing technical approach to protecting priva-
cy, it is not a magical, cure-all solution.
Given the limitations of technical mea –
sures in the protection of privacy, a call has
arisen to appeal to economic protections,
and perhaps even stimulate open markets
for personal data. This approach, which we
here refer to as ownership assignment, is
exem pli½ed by computer scientist Alex
Pentland’s “Reality Mining of Mobile Com –
munications: Toward a New Deal on Data,”
which urges that users should “own their
own data.”13 Old English Common Law
en capsulates this idea in three general rights
for tangible property: users should control
the possession, use, and destruction, or dis –
persion, of their data. The “New Deal on
Data” goes a step further: users should also
be able to treat data handlers like banks,
withdrawing their data if desired and, as
with Swiss banks, storing it anonymously.
This deal, which is grounded in a common
sense physical model, is enticing. But data
are distinctly different from land or money.
Data management is far more complicated,
and it de½es physical transactional models.
An acre or a dollar cannot be arbitrarily
replicated by anyone who sees it. Nor can it
be math ematically transformed into a new
object.
Data, on the other hand, are in½nitely mal –
leable. They can arise in unexpected places
and be combined and transmog ri½ed in an
unimaginable number of ways.
To understand the complexities of data
own ership, we might ask: who owns the
da ta you generate when you purchase elec-
tronic toys from Amazon or food from
FreshDirect? Who owns the information
produced by your viewings of movies on
Net flix, or videos on YouTube? Who owns
the data generated by your Android phone,
purchased from Cyanogen, and connected
to the T-Mobile network, to say nothing of
the “physiological” data generated by third-
party software on your Fitbit or Apple
Watch?
48
Dædalus, the Journal ofthe American Academy of Arts & Sciences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
In a previous issue of Dædalus on “Pro-
tecting the Internet as a Public Commons,”
legal scholar Helen Nissenbaum articulat-
ed relevant alternatives to property rights
through her suggestion that we understand
privacy as contextual integrity (the idea
that privacy is a function of social norms
and the environment in which disclosure
occurs). She argues that instead of focus-
ing on own ership assignment, we focus on
the right to access.14 Our essay is an argu-
ment for that right, and further, for the
embodiment of that right in the data and
services markets and architectures that we
are investing in as leaders of organiza-
tions, designers of pro ducts, executors of
regulations, and consumers of services.
But even with this formulation of proper-
ty rights, complications arise. Many small
data settings invoke involuntary hazard, in
which the handling of small data by one per –
son can affect the privacy or rights of an –
other without his or her knowledge or in-
volvement. This can occur either from joint
data creation or from interactions between
individuals on a given platform. Emails,
blog posts, and group photos all implicate
people captured or referenced in these me-
dia with or without their consent, just as a
Face book “gift” creates a record of the
send er and the (potentially unwitting) re-
cipient. Many more forms of involuntary
hazard will arise as cameras and sensors
proliferate, as small data are increasingly
aggregat ed in the cloud, and as analysis and
correlation of small data streams yield new
insights. Inno cent bystanders in photo-
graphs, for ex amp le, could also be implicat-
ed by data sharing.
Kinship gives rise to a particularly strik-
ing example in the small data handling of in –
voluntary hazard. Parenthood–the ultimate
act of joint creation–creates shared gen etic
material among kin. This genetic data pro-
vide a long-term window into a person’s
health prospects and behavioral character-
istics. Given the sensitivity of such data,
the U.S. Federal Genetic Information Non –
discrimination Act (gina) of 2008 prohib –
its the use of genetic information by health
insurers and employers. As a result of direct-
to-consumer genetic testing, however, some
people choose to post their genetic data
online in repositories, such as Opensnp,
to help catalyze medical discoveries.15 This
choice impacts their kin and descendants,
potentially for many decades. As shown in
a study by security and privacy researcher
Mathias Humbert and colleagues, genetic
data enable strong inferences ab out the
predisposition of people related to carriers
of Alzheimer’s disease toward developing
it themselves.16
Of all the problems with privacy and ac-
countability mechanisms described here,
the most fundamental challenge is, perhaps,
psychological in nature. As with health risks
associated with exposure to toxins in the air
and water, individuals’ welfare in terms of
pri vacy is typically degraded more by cumu-
lative exposure than by acute events. Gal-
vanizing people to address gradual threats
is a signi½cant and major challenge, with-
out a simple solution.
Given the challenges we have enumerat-
ed, key design decisions made today will
determine whether we can foster an equi-
table future society that, while flooded with
small data, respects both the value of in –
dividual ity and personal rights and prefer-
ences. Such a society could empower indi-
viduals to im prove their well-being, social
connections, and productivity through in-
tentional use of their small data, while large-
ly avoiding the harmful side-effects of data
sharing, such as loss of privacy and vulner-
ability to predatory businesses. Amid an ex –
plosion in the gen eration, collection, and
analysis of small data, however, as well as a
resulting erosion of existing models of rights
and control, how can we articulate and nav –
igate the decisions needed to realize this
vision?
Deborah
Estrin &
Ari Juels
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
145 (1) Winter 2016
49
Reassem-
bling Our
Digital
Selves
We believe that it is both critical to take a
step back from existing models of small-
data use, con½dentiality, and control, and
to frame and reflect on three foundational
questions.
What are the practically realizable roles and
rights of the individual in the management of
small data? Granting ownership and control
to individuals over their small data alone
will not enable meaningful stewardship.
History has shown that many individuals
do not have the time or interest to adminis-
ter ½ne-grained policies for data ac cess and
use. (Facebook privacy settings continue to
baffle the majority of users.)17 It is increas-
ingly impractical for people ev en to be aware
of what data they have produced and where
it is stored. As small data become ubiqui-
tous, con½dentiality will be come increas-
ingly dif½cult to protect, and leaks may be
inevitable. What practical rem edies are
there?
We suspect that any workable remedy will
foremost recognize that individuals’ rights
should not end with disclosure, and should
instead extend to data use. Thus, policies
such as hipaa, which emphasize con½den –
tiality as a means of restricting data flow, will
need to be supplemented by rights protec –
tions that encompass disclosed data and
create fair use and accountability. Consid-
er, again, the example of gina: if people
publish their genetic data, their kin should
remain protected.
What fundamental bounds and possibilities
exist in data privacy and accountability for small
data? There will always be trade-offs be-
tween utility and con½dentiality. As de-
scribed above, encryption and sanitization
can achieve con½dentiality, but often at the
expense of the data’s usefulness. While
emerging cryptographic technologies (such
as secure multiparty computation and fully
homo morphic encryption) have basic limi-
tations and probably will not alter the land-
scape for many years to come, they delin-
eate possibilities.18 They show, for example,
that it is possible to mathematically simu –
late a “trusted third party” that discloses
only preagreed-upon results of computa-
tion over data without ever revealing the
underlying data. Trusted hardware such as
the pending Intel sgx technology holds
sim ilar potential, but with much more prac –
tical, medium-term promise.19
Access in such a trusted third-party mod-
el can be time-bounded–granted for past
data, present data, and/or future da ta–and
lim ited according to any desired criterion
or algorithm. Such a permissions-based
mod el is applicable to both streaming and
sta tic data, and is especially useful for “acti-
vated” linked-data that is long-lived, stream-
ing, and distributed and used in various
ways to drive models and algorithms. This
model can create a high degree of account –
ability by constraining and recording the
flow of data.
A simulated trusted third party can offer
richer options than releasing sanitized data
to researchers. For example, the mooc da –
ta in the HarvardX and mitx study could be
made available to researchers not as a sani-
tized dataset, but as an interface (an api, or
application program interface) to a system
that manages the raw data.
Nonetheless, public or semipublic release
of data will always exist in society; thus, we
ought to understand privacy as contextual
integrity, a function of social norms and the
environment in which disclosure occurs.20
This concept points toward a future in
which semantics automatically govern the
flow of data. An intelligent system could
discover and determine when and how to
process and release data on behalf of con-
sumers, and when and how to fabricate
plausible white lies on their behalf (a key
social norm). This would be a boon for data-
enriched social engagement that would al-
so help restore control to users.
What market demands, government regula-
tions, industry self-regulation (through standard –
ized terms of service), and social norms will shape
50
Dædalus, the Journal ofthe American Academy of Arts & Sciences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
is a historical anomaly, possibly born of the
urban revolution.23 In an age of sel½es and
social networks, there is every reason to be-
lieve that the individual’s notions of bound –
aries and privacy, and of what constitutes
personal and public, will continue to shift.
Explicit models of the social norms driv-
ing policy and practice, and their relation-
ships with market forces and government
regulation, must be central to the project of
designing the architecture of next-genera-
tion small-data systems.
Building methods, tools, and systems with
small data in mind is an explicit design
choice. By recognizing the role of the indi-
vidual as bene½ciary of her own data-driven
applications and services, we are choosing
to consider design criteria that are differ-
ent from those faced by service providers.
If we build systems and market practices
that routinely provide people with direct
pro grammatic access to their small data,
along with the ability to export and use it,
app lications and services can offer users
the bene½t of highly individualized mod-
eling and recommendations that would
neither be possible nor acceptable other-
wise. And yet, in building such systems, how
do we also provide consumers with the safe-
guards to manage small-data exposure and
its consequences in the long term, while still
maximizing individual bene½t?
We have presented what we believe to be
the core challenges raised by small data. We
hope that posing these questions is a ½rst step
in the direction of secure and bene½cial use
of small data–by individuals, governments,
and enterprises alike.
the rights of consumers to have access to their small
data? To answer this question, we might
look at the striking tensions in the commer –
cial handling of health and ½t ness data to-
day. Recognizing the growing importance
of health-related data, Apple has offered
HealthKit, an app that serves as a hub for
such data drawn from mobile apps. At the
same time, the company is treat ing personal
data like a hot potato: Apple does not store
or otherwise access its users’ health data,
leaving this task and liability with app de-
velopers. Meanwhile, app developers are
ravenously collecting “½t ness” data that are
likely to function as health data in the fu-
ture. For example, researchers have shown
strong correlations between gen eral health
and physical movement through out the day,
and many such apps track physical move-
ment. None of these data are being managed
under the aegis of hipaa.21
Users often acquiesce to service pro vid –
ers, such as social networks, health and ½t –
ness app developers, and online retailers
that take possession of their small data and
hold it captive, not sharing it with the users
themselves or facilitating its export. Online
superpowers like Facebook maintain con-
trol over user data and interactions to the
extent of being able to influence voter turn –
out in national U.S. elections.22 Will our
digital selves be reassembled on our behalf
by monolithic service providers? Or will an
ensemble of entities instead act in concert,
according to individual users’ tastes and ob –
jectives? For more than a decade, mobile
phones were controlled by the mobile ser –
vice provider; since the emergence of smart –
phones and app stores, control has shifted to
the consumer. Which future should we de-
sign for? Should ownership of personal data
be an inalienable right, rather than one that
can be blithely signed away through a terms-
of-service agreement?
As Vint Cerf, coinventor of the Internet
architecture and basic protocols, has re-
marked, privacy as we conceive of it today
Deborah
Estrin &
Ari Juels
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
145 (1) Winter 2016
51
Reassem-
bling Our
Digital
Selves
endnotes
* Contributor Biographies: DEBORAH ESTRIN, a Fellow of the American Academy since 2007, is
Professor of Computer Science at Cornell Tech and Professor of Healthcare Policy and Re-
search at Weill Cornell Medical College. She is Founder of the Jacobs Institute Health Tech
Hub and Cofounder of the nonpro½t Open mHealth. Her recent publications include articles in
Journal of Medical Internet Research, Journal of Acquired Immune De½ciency Syndromes, and ACM
Transactions on Intelligent Systems and Technology.
ARI JUELS is Professor at the Jacobs-Technion Cornell Institute at Cornell Tech. He has recently
published articles in Journal of Cryptology, Communications of the ACM, and IEEE Security & Privacy
Magazine.
1 Deborah Estrin, “Small Data, Where n = Me,” Communications of the ACM 47 (4) (2014): 32–34.
2 See the collaboration between users and manufacturers of self-tracking tools at http://quanti½ed
self.com/.
3 See https://www.patientslikeme.com/; and https://corp.inspire.com/patients-caregivers/.
4 For an example of such terms of use, see MyFitnessPal, “Terms of Use,” http://www.my½tness
pal.com/account/terms_and_privacy?with_layout=true (accessed January 23, 2015).
5 “Fighting for the Right to Open His Heart Data: Hugo Campos at tedxCambridge 2011,” tedx
Talks, uploaded January 19, 2012, https://www.youtube.com/watch?v=oro19-l5M8k.
6 J. K. Trotter, “Public nyc Taxicab Database Lets You See How Celebrities Tip,” Gawker, October
23, 2014, http://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546.
7 In fact, Emil Michael, Uber’s senior vice president of business, had suggested to a private audi-
ence that the company might dedicate a portion of its ½nancial resources to private researchers
to investigate adversarial journalists’ personal lives in retaliation for their negative coverage.
See Gail Sullivan, “Uber Exec Proposed Publishing Journalists’ Personal Secrets to Fight Bad
Press,” The Washington Post, November 18, 2014, http://www.washingtonpost.com/news/
morning-mix/wp/2014/11/18/uber-exec-proposed-publishing-journalists-personal-secrets-to
-½ght-bad-press/; and Chanelle Bessette, “Does Uber Even Deserve Our Trust?” Forbes, No-
vember 25, 2014, http://www.forbes.com/sites/chanellebessette/2014/11/25/does-uber-even
-deserve-our-trust/.
8 Bessette, “Does Uber Even Deserve Our Trust?”
9 For example, Munn and other celebrities whose rides surfaced in this data-mining exercise
were criticized for not tipping their taxi drivers. Some alleged, though, that the taxi drivers
themselves intentionally failed to record tips. In other words, Ms. Munn’s small data may have
been corrupted by a “privacy-conscious” (a euphemism for “tax-evading”) taxi driver.
10 Jon P. Daries, Justin Reich, Jim Waldo, Elise M. Young, Jonathan Whittinghill, Daniel Thomas
Seaton, Andrew Dean Ho, and Isaac Chuang, “Privacy, Anonymity, and Big Data in the Social
Sciences,” ACM Queue 12 (7) (2014), http://queue.acm.org/detail.cfm?id=2661641.
11 Benjamin C.M. Fung, Ke Wang, Rui Chen, and Philip S. Yu, “Privacy-Preserving Data Publishing:
A Sur vey of Recent Developments,” ACM Computing Surveys (csur) 42 (4) (2010), doi:10.1145/
1749603.1749605.
12 Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart,
“Pri vacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing,”
Pro ceed ings of the 23rd USENIX Security Symposium (Berkeley: usenix, 2014), https://www
.usenix .org/con ference/usenixsecurity14/technical-sessions/presentation/fredrikson
_matthew.
13 Cynthia Dwork, “Differential Privacy: A Survey of Results,” in Theory and Applications of Mod-
els of Computation, ed. Manindra Agrawal, Dingzhu Du, Zhenhua Duan, and Angsheng Li
(Berlin: Springer Berlin Heidelberg, 2008), 1–19.
52
Dædalus, the Journal ofthe American Academy of Arts & Sciences
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
14 Alex Pentland, “Reality Mining of Mobile Communications: Toward a New Deal on Data,” in
The Global Technology Report 2008–2009: Mobility in a Networked World, ed. Soumitra Dutta and
Irene Mia (Geneva: World Economic Forum, 2009).
Deborah
Estrin &
Ari Juels
15 Helen Nissenbaum, “A Contextual Approach to Privacy Online,” Dædalus 140 (4) (Fall 2011):
32–48.
16 Opensnp, https://opensnp.org.
17 Mathias Humbert, Erman Ayday, Jean-Pierre Hubaux, and Amalio Telenti, “Addressing the
Concerns of the Lacks Family: Quanti½cation of Kin Genomic Privacy,” Proceedings of the 2013
ACM SIGSAC Conference on Computer & Communications Security (2013): 1141–1152, doi:10.1145/
2508859.2516707.
18 Yabing Liu, Krishna P. Gummadi, Balachander Krishnamurthy, and Alan Mislove, “Analyzing
Facebook Privacy Settings: User Expectations vs. Reality,” Proceedings of the ACM SIGCOMM
Conference on Internet Measurement (2011): 61–70, doi:10.1145/2068816.2068823.
19 Craig Gentry, A Fully Homomorphic Encryption Scheme, Ph.D. dissertation for Stanford University
Department of Computer Science (September 2009), https://crypto.stanford.edu/craig/craig
-thesis.pdf; and Marten van Dijk and Ari Juels, “On the Impossibility of Cryptography Alone
for Privacy-Preserving Cloud Computing,” HotSec 2010 Proceedings of the 5th USENIX Conference
on Hot Topics in Security (Berkeley: usenix, 2010), 1–8.
20 Ittai Anati, Shay Gueron, Simon P. Johnson, and Vincent R. Scarlata, “Innovative Technology
for cpu Based Attestation and Sealing,” Proceedings of the 2nd International Workshop on Hard-
ware and Architectural Support for Security and Privacy (June 2013).
21 Nissenbaum, “A Contextual Approach to Privacy Online.”
22 Robert Ross and K. Ashlee McGuire, “Incidental Physical Activity is Positively Associated with
Cardiorespiratory Fitness,” Medicine and Science in Sports and Exercise 43 (11) (2011): 2189–2194.
23 Zoe Corbyn, “Facebook Experiment Boosts U.S. Voter Turnout,” Nature News, September 12,
2012, http://www.nature.com/news/facebook-experiment-boosts-us-voter-turnout-1.11401.
24 Gregory Ferenstein, “Google’s Cerf Says ‘Privacy May Be An Anomaly.’ Historically, He’s
Right,” TechCrunch, November 20, 2013, http://techcrunch.com/2013/11/20/googles-cerf-says
-privacy-may-be-an-anomaly-historically-hes-right/
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
d
a
e
d
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
/
1
4
5
1
4
3
1
8
3
0
6
9
5
d
a
e
d
_
a
_
0
0
3
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
145 (1) Winter 2016