Facebook can infer many things, even about people who deliberately stay away
Hi, this is Steven Cherry for IEEE Spectrum’s “Techwise Conversations.”
We all have things we don’t want to put on Facebook, and for some, the
loss of privacy is so large that they stay off the social network
entirely. But it turns out that, to quote heavyweight champion Joe
Louis, “You can run, but you can’t hide.”
To quote a research paper published last month on PLoS One
, “With
the help of machine learning, social network operators can make
predictions regarding the acquaintance or lack thereof between two
nonmembers with a high rate of success.”
It’s been known for a while that Facebook makes a shadow profile of
people it learns about who aren’t on Facebook. What the researchers here
found was that they could predict, with a surprising degree of
accuracy, whether two such nonmembers were acquainted with each other.
The paper, entitled
“One Plus One Makes Three (for Social Networks),” was written by a team from the University of Heidelberg, and my guest today is the corresponding author,
Katharina Anna Zweig.
Nina, as her colleagues call her, is a professor of theoretical
computer science, specializing in network analysis and graph theory. She
joins us by phone from her home in Heidelberg, where it’s late at night
already.
Nina, welcome to the podcast and
guten abend.
Katharina Anna Zweig: Thank you very much for this invitation to speak to you.
Steven Cherry: So, there’s maybe several parts to this. First,
social networks often make guesses about whether two people know each
other, and we’ve all seen this. LinkedIn, for example, has a “people you
may know,” Facebook has a whole column on the side called “Know them?”
Google + has something similar. How do these networks make those
inferences?
Katharina Anna Zweig: So, I don’t know, actually. So
we—actually, this was the starting point of our project. So my
colleague, Professor Hamprecht, got such an e-mail, like “You are a
nonmember, but you might know these members of”—in this case—“Facebook,”
and he was startled. Because these people were not people that invited
him personally. It was—it was a good set of people, and he actually knew
almost all of them, and he’s in machine learning, so he—he was really
startled, and so the...how? How do they know? And then we found out that
whenever you register with Facebook, Facebook will ask you for your
full e-mail address book. And so it’s quite easy for them to know that
these members know him, even if they didn’t choose to invite him
personally.
Steven Cherry: Okay, so tell us about your research and how you tried to figure out what Facebook was doing.
Katharina Anna Zweig: So as we teamed up with a machine learner
and a network analysis guy; then we wanted to understand whether we
could infer something about the relationship between nonmembers if we
had information about the network between members and who they know
outside of any social network. And so what we did is that for each two
nonmembers, we looked at the people that are on the membership site and
how they are connected to each other. And you can, you can call this
“features.” So, for example, we looked at person A and person B, and
these are nonmembers. And we looked at how connected...their friends on
the social network platform would be connected. And this would give a
vector of 15 features—properties. And then with a quite standard
machine-learning algorithm, we tried to give a prediction value to these
vectors, and what happens then is that for each two people we get a
score. And we would then say that the highest 10 percent or 20 percent
we would predict as being connected, and then we can check in our data
whether they really are connected or not. And this is what we did. And
yeah, we saw that we can under some assumptions, we can predict about 40
percent of the connections between nonmembers correctly.
Steven Cherry: We should point out that we, I guess, you were
looking at both predicting that members were friends when they were and
were not when they were not. And that both went into the 40 percent.
Katharina Anna Zweig: Yes, it did. In a way, because it’s always
easy if you predict everyone to be connected, then you will always get a
100 percent score, but you will make a lot of wrong guesses.
Steven Cherry: Yeah, I guess we should talk about the machine
learning a little bit. I gather this involves taking a training set, so
you take a sample where you do know everything, and you figure out
basically an algorithm, and then you—I’m sorry—you figure out an
algorithm, and then you point it to a training set where you already
know all of the answers.
Katharina Anna Zweig: And of course we also know it for doing
the quantitative analysis. So we not only know the answers for the
training set, but we also know it for the nontraining sets, and this
information is used to assess the quality of the algorithms. So the
training phase is used to find out—so I told you earlier about these 15
properties in the vector. And not all of these properties are really
important. The machine-learning algorithm helps us to find out which of
the properties is really important for predicting whether two nonmembers
are connected or not. And the machine-learning algorithm understands,
based on some training sets where we tell him, which of the nonmember
pairs are really connected, are friends. Based on this, the
machine-learning algorithm learns ways. So, for example, he learns that
if two nonmembers have at least five friends together on the social
network platform side, and these five friends are also connected with a
probability of 0.5, then this is a very good indicator that the two
nonmembers are also connected or are also friends. And so this is the
first phase of the machine-learning algorithm, where we learn the ways
of the teachers. And then [in] the second phase, we will then make our
predictions on a set of samples that the algorithm has not seen before.
And these two sets are really independent, so there’s no choice of doing
anything spooky there. And in the second phase, we will then evaluate
how good our algorithm was. So hopefully in the second phase we do have
the real relationships, and we only try to predict as many relationships
as there are really in the data sets. And based on this known “ground
truth,” we call it, based on this, we can say how many of the ones that
are really there were predicted by our algorithm.
Steven Cherry: So what was your—what was the set of people that you worked on?
Katharina Anna Zweig: Okay. So, this is a very good point,
crucially, so we were able to get access to five real Facebook data sets
from five universities. So this is data from 2005, and at this point
almost 80 percent of all of the students in these five universities were
really members of Facebook. And we know the total network structure
there. What we did not know is who these people knew outside. Yeah, so
we just had the data between the people on Facebook. So, what can you do
if you don’t have the data that you need? You need to emulate it in a
way. So we took this data and had five very different evolutionary
models of how people decide to become a member of a new social network
platform. So how did you decide to become a member of any social network
platform? Maybe one of your friends was already there. Or you thought
it was just a very cool social network platform. And so our five models
scales between a totally dependent behavior, where your own decision
depends on how many of your friends already are members, and the totally
independent decision. And we have five different models, but we’re kind
of scaling between these extremes. And we used these models to
artificially divide this real data set into members and nonmembers. For
this artificial division, we knew exactly who is connected to whom, but
our algorithm would not know this. So for all five evolutionary models
of how somebody decides to become a member, we got almost the same
results qualitatively.
Steven Cherry: Now, it turns out that the prediction is more
accurate, I gather, the more, the higher percentage of people are on
Facebook. So I guess in the United States, for example, something like
70 percent of all adults in the U.S. are on Facebook, and that makes it
relatively easy, I gather, to make predictions about the remaining 30
percent.
Katharina Anna Zweig: Yes it does. So this is one parameter we
took into account. I really need to stress that we are not talking about
Facebook because we are just talking about all social network platforms
that have information about their members and the connection of members
to nonmembers. So of course you’re totally right that this parameter is
very important. If we were looking at a social network platform where
only 5 percent or 10 percent of any population is a member of, then we
couldn’t do much because basically machine learning is always depending
on statistics. But what we saw is that, first of all, we don’t need to
look at the full population. Sometimes we might be only interested in
schoolkids or students of a given university. And in the subpopulation,
as you said, the percentage might be very high and might go to 70, 80
percent. For our machine-learning algorithm, we don’t need much more
than 50, 60 percent.
Steven Cherry: So I guess the lesson here is that if people stay
off of a social network, whether it is Facebook or any other, because
they basically don’t want that social network to have information about
them, in a way, staying off may not be sufficient, because the social
network may know about them through the people who are members and can
start to collect information and make inferences about them, even if
they’re not members.
Katharina Anna Zweig: You know, I’m a scientist, I always want
to say that “What can I really infer, and what can I not infer?” Ah,
yeah, there is an important point. The important point is that in our
study, we only used the contact data. Because we didn’t have any other
social information about the people. So if we were a social network
platform, we would have additional information on the age, on the
location, on the education, whatever. And if you take this additional
information into account, it’s very likely that you can do a better
inference than what we did now.
Steven Cherry: So these supplement the 15 features or replace some of them.
Katharina Anna Zweig: Totally.
Steven Cherry: And the accuracy could potentially be much, much higher than that 40 percent.
Katharina Anna Zweig: Yes. So this is what my experts, my
colleagues in machine learning, say. They expect that if you take these
additional information into account—we know that a connection between
people is always driven by homophilies, so it’s much more likely that
two people of the same age will be connected than two people that are
very different in age, and so on. So taking this additional data into
account would certainly improve the quality. And also the social network
platform has a time, or dynamic, information, because they can see when
a former nonmember...whether their predictions were right or not.
So...and it’s very, very important to us that we don’t say that social
network platforms are currently doing this, and I also wouldn’t know
what type of harm can be done with it. And so our study is not about
this. We don’t know anything about this. It’s just about this fact that
staying away from social network platforms is not enough. And I was so
much surprised, because when I thought about these results, I felt
guilty. So we have a very famous German magazine, and I often sent an
article to my brother, to my father, via this website, yeah? I just
click on the link “send article to friend.” And so I’m actually doing
the same because this platform can see what types of articles I like.
And when I give an information or when I enter the e-mail address of my
brother, then this platform also gets information that I have a
relationship to this person and that I think this person is interested
in the article. So I think actually our study hints at something much
bigger. We are constantly leaving information about relationship to
others, and relationship is by definition an information between two
people. But if I give you the information that I know my brother, I know
this person, can he use this information? Is this okay or not? And I
think as a society we need to speak about this. What’s—what do we want
companies or others to do with information about a relationship that is
only revealed from one of the two sides?
Steven Cherry: And the information implied by these connections
can be pretty personal. They could involve political beliefs; they could
involve, perhaps, illnesses that a person might be suffering from; they
could involve a person’s sexual orientation—or anything.
Katharina Anna Zweig: Yes, definitely. So I saw on some wall on a
social network platform something like “Get well soon.” And this...this
was an information that the other person revealed about the person
whose wall it was. So I think we cannot—I think we cannot really
prohibit that our friends are posting these informations, but we need to
talk about whether this information can be used or should be free in
any way.
Steven Cherry: Very good. Well Nina, as social networks become
more powerful, it’s very important to figure out just how powerful
they’re becoming, so thank you for your research, and thank you for
joining us today.
Katharina Anna Zweig: Yeah, thank you very much.
Steven Cherry: We’ve been speaking with network researcher
Katharina Anna Zweig about how when a social network gets big enough, it
can figure out a lot about the people who stay off the network. For
IEEE Spectrum’s “Techwise Conversations,” I’m
Steven Cherry.
Announcer: “Techwise Conversations” is sponsored by National Instruments.
This interview was recorded 9 May 2012.
Audio engineer: Francesco Ferorelli
Read more Techwise Conversations or follow us on Twitter.
NOTE: Transcripts are created for the convenience of our readers
and listeners and may not perfectly match their associated interviews
and narratives. The authoritative record of IEEE Spectrum
’s audio programming is the audio version.
Ref:
Spectrum