AI, Big Data, & the End of Privacy: An Interview w/ Michal Kosinski

privacy.jpeg

AUGUST 1, 2019

This week, we are featuring an interview with Dr. Michal Kosinski, an Associate Professor of Organizational Behavior at Stanford’s Graduate School of Business. Dr. Kosinski is an expert on privacy, big data, and psychological targeting -- all topics of increasing importance to the public conversation about technology. His findings have been the subject of considerable controversy; most notably, the discovery that sexual orientation can be reliably predicted from pictures of faces, which we made reference to in a prior newsletter. However, as with many discussions around AI, this controversy is fraught with misconceptions, both about the technology itself and the tradeoffs inherent to using it ethically. Wherever one comes down on these issues, the details matter. In the interest of preserving these details, we’ve decided to publish this interview in long form. We hope it will be rewarding to anyone interested in this important topic.

PTI: Let’s begin with the topic of AI bias, which we’ve written about before. Many people are concerned about AI bias, but there seems to be a great deal of confusion about where precisely the bias enters the system, and who is to blame for any particular instance of bias. Much of your work focuses on AI systems and what we can infer from them. What do you think are the biggest misconceptions people have about AI bias?


MK: It’s not the focus of my research -- it’s more of a misconception that people have about AI in general. But the first misconception I can see is that people completely lose sight of the fact that no decision-making or measurement process can be perfect. There’s always going to be bias. And now the goal is not to eradicate the bias completely, because that’s impossible, but to minimize the bias. And this is something that sounds obvious and clear, but it has tremendous consequences.

To give you an example, some statistical and machine learning systems are being used in parole and sentencing decisions. And we can prove beyond any doubt that these algorithms will be, to some degree, biased. Usually, people conclude that we cannot introduce biased algorithms in practice, failing to consider the alternative: a human judge that is even more biased. 
Interestingly, people also argue that algorithms are biased because they are trained on biased data produced by biased humans. But what is hidden in this argument is the notion that human decision-makers or human recruiters or human clinical psychologists are inherently biased. So if we train algorithms on this data and then put some effort into removing the bias, it’s impossible not to do better than the original generative process that came up with the data. 


PTI: It seems there is another confusion still, which is that the term bias is applied relative to different goals. One can be biased relative to the goal of, for example, predicting most accurately who is going to recidivate. Or one can be biased such that, whatever the accuracy, the results end up targeting certain populations more than others. And it does seem that people try to conflate these two senses of bias, the procedural bias and the outcome bias.


MK: That’s completely right. Across the world, males are much more likely to incarcerated. But does that mean that the judges are sexist? Absolutely not. It’s just that males are, for many different reasons, more likely to commit crimes. So you can have a bias in the population, but it doesn’t essentially mean it’s unfair. Quite the opposite: such bias may represent real-world phenomena.


PTI: Why, then, do people mistrust AI systems more, or worry more about their bias?


MK: We like to forget that human decision makers, such as judges, are biased, typically much more than an algorithm that could replace them. It is much more difficult to measure bias in humans. One cannot, for example, apply a human judge to millions of archival cases to measure their bias, giving human decision makers a kind of deniability. Whereas, the algorithms are an easy target as their bias can be relatively easily measured. Of course, this enables reducing the bias, but it also gives critics ammunition to to criticize the algorithms. 


PTI: I know AI bias is not your direct area of research, but it does seem that some of the backlash to your work has resulted from similar misconceptions. For example, some critics of your faces/sexual orientation study seem torn between a view that the study is false (because faces do not predict sexual orientation) and that it is true but dangerous (because such technology targets gay people disproportionately). Do you share this sense?


MK: I totally feel this way, and I think that there’s an even deeper problem. AI is a relatively new and poorly understood phenomenon, so people don’t really understand how it works. Many concluded that “Kosinski created Gaydar,” but I did not. I took an existing facial recognition library -- of the kind that could be found on your phone -- and showed that it has the potential to discriminate between gay and straight faces, creating a serious privacy threat. Very similar facial recognition technologies are already being used in airports and elsewhere to catch criminals, and homosexuality is a crime in some countries, so it’s conceivable that it could be used in this way. 
Critics have also argued that such systems just shouldn’t be built. But as you’ve pointed out, there is still a danger even if a “Gaydar” algorithm isn’t built directly.
In the past, if you wanted to have a “Gaydar” algorithm, you needed to ask your engineers to create one. Today, you can take off-the-shelf facial recognition software, or hire one of the gazillions machine learning startups, and say, “Hey, I have 1000 people of Type A and 1000 people of Type B,” learn how to distinguish between them. You don’t have to tell the software, or the engineers, what types A and B stand for. The algorithm will just look at those faces and learn how to distinguish them, and only the person or organization that ordered the analysis will know what predictions are being made. And now, such an algorithm can be applied to make predictions about anyone’s face.
And by the way, facial recognition is just one of many avenues that can be used to invade our privacy. A very similar mechanism is used by Facebook’s Lookalike Audiences - have you heard of it?


PTI: I haven’t, no.


MK: Facebook, as well as other online advertising platforms, allows advertisers to upload lists of users identified by their phone numbers, emails, or other personal identifiers. Next, an advertiser can target such users with ads and messages. They can also ask Facebook to find other users that look alike those on the uploaded list and target them with ads and messages. 
Now, guess who looks like liberal, or gay, or atheist users? Other liberals, gays, or atheists! Now, an advertiser can show messages to such users and if they engage with the message or navigate to their store, the advertiser knows that they are likely to belong to the targeted class. Now, neither the targeted users, nor Facebook, know that they are being targeted because they are liberal, gay, or atheist. 
In practice, imagine an organization, for example the Westboro Baptist Church, that uploaded to Facebook 10000 emails of gays obtained, let’s say, from an online forum. Next, such an organization asks Facebook to show ads, let’s say ads promoting an online petition, to millions of people who look like those in the uploaded sample - other gays. Now, WBC can rest assured that most of the people that see and click on the ad are gay, collect their emails, harass online or offline, and so on. 


PTI: Let’s address some of the other critiques of your faces/sexual orientation study. Two of the primary ones are that you assumed binary sexual orientation and that the neural network you used picked up non-morphological characteristics (e.g. facial hair, glasses, picture angle, etc.). What do you make of these claims?


MK: It’s spelled out explicitly in the paper that we don’t assume that sexual orientation is binary. People adopt a multitude of sexual identities and they have fuzzy boundaries. The fact that we focused on distinguishing between two of them does not mean that the others don’t exist. In fact, have we had access to non-binary labels, the classifier would likely to be even more accurate. 
With respect to non-facial features, again, we stated explicitly in the paper that those pictures were not standardized, and of course you will have a mixture of different factors. They’re interacting with each other and it’s extremely difficult to disassociate them. Our claim was that it is possible to distinguish between gays and straight people based on their facial images. When looking for factors that enabled the prediction, we noticed that gays facial images were more gender atypical - both in terms of facial morphology and other factors, such as grooming. 
Given the non-standardized character of the images, it is difficult to say whether such gender atypicality is driven by actual morphological differences between gay and straight people, or is it mostly grooming and self-presentation. This is, however, somewhat irrelevant in the context of privacy risks. They are equally serious regardless of whether the differences are morphological or not. Those risks are also extremely difficult to address because there seem to be many subtle differences rather than a few major ones. So, for example, if both gay and straight people started wearing beards in equal proportions, the algorithm can use one of the myriad of other variables distinguishing between them. 
Another misrepresentation of our findings is that we argued that our paper speaks to the origins of sexual orientation. What we write in the paper is that our findings are consistent with prenatal hormone theory, the most popular and widely accepted theory explaining the origins of sexual orientation . PHT suggests that sexual orientation is unlikely to be the only gender atypical characteristic of gay people, which is exactly what we observed in our results: gay faces tended to be gender atypical which in terms of facial grooming, self-presentation, and even morphology. 
Importantly, our paper was not designed to prove or disprove PHT. It is a well-established theory. The fact that our findings are consistent with this popular theory, however, lends additional credibility to our conclusions. I’m not aware of a genuine subject-matter expert in human sexuality that would say these results don’t make sense, but I would be glad to learn if there is one. 


PTI: Stepping back a bit from this particular paper, it seems that some of the critiques you’ve faced and many of the concerns about AI more generally stem from a discrepancy between the scientific picture of human nature and the popular or common-sense view. When machines notice things about us that draw attention to this discrepancy, we get uncomfortable. Is that an impression you share?


MK: Oh, totally. The lack of free will, for example, is probably one of the most important findings in psychology, and yet both the scientific community and the general society have not really caught up to it. The claim that free will exists is not only extraordinary but would require some magical aspect to the functioning of the physical world and the human brain. Essentially, people who insist on the existence of free will believe in magic. 


PTI: I guess the touch point to your research is that, if there’s no free will, there may be no limit to this project you’ve been engaged in, of noticing various correlations between behavior and psychological or biological characteristics. And to the extent that there’s a popular misunderstanding of how humans work, those findings will be controversial because they encroach on our self image. 


MK: That’s true, but I think a misunderstanding there is the lack of free implies that our behavior is fully predictable. This could not be further from the truth. Behavior can be both fully determined and unpredictable. When rolling a pair of dice, for instance, you don’t know what the outcome is going to be, but you can perfectly determine the distribution of possible outcomes, right?


PTI: It can be probabilistically predicted.


MK: Exactly. It’s like in a chess game. When you look at the end game, skilled players don’t even have to continue to play, because they know already from the position of the board who has won. At the beginning of the game, however, while the outcome is similarly determined, it is much more difficult to predict and hence it is interesting to keep playing. When you translate that into our real social lives, which are just so much more complicated than a chess game, then it becomes obvious how difficult it is to predict the future. And this is one of the reasons behind the illusion of there being a free will. 
It’s likely however, that from the perspective of AI, one day a game of human life will become as predictable as a game of chess. We now know AI can easily map all the potential outcomes of the chess game; it could, potentially, do the same for human lives one day.


PTI: So let’s transition now to some of the other consequences of living in a world of advanced AI and rampant data collection. You’ve argued that we need to accustom ourselves to living in a “post-privacy world.” What does the term “post-privacy” mean to you, and how can we make such a world safe?


MK: Well, those are two big questions. To address the first, I think it’s pretty clear that already today, a determined third party can learn more about us than we would or should be comfortable with. Potentially even more than we know about ourselves. If this is surprising, consider the following example: we ask doctors to use diagnostic tools to predict future outcomes with an accuracy that we could not achieve otherwise. The same is true of psychologists or career counselors, for example. We ask them for predictions about our future lives that we are not able to make ourselves, for lack of knowledge or a correct predictive model.
Now, the interesting point is that the same predictions can be made, with higher accuracy, with statistical or machine learning models. It is not at all new knowledge. Lew Goldberg, for example, showed in the 50s, that a simple equation can outperform psychiatrists in diagnosing various mental disorders. And now, of course, we’ve moved away from those simple equations to very complex AI and machine learning models that can do it even more accurately, on a large scale, and using data that is widely available, like your browsing history, GPS location logs, Facebook Likes, and so on.
So we are now living in a world where all intimate traits, preferences, and future behaviors or potential, can be discovered without one’s knowledge, very quickly, and at a low cost. And there’s no simple way to escape that, as most of us do not have the comfort of leaving behind our smartphones or emails. Thus, I believe, that our privacy is already gone and we are just going to have less of it.
But now, what to do about this? Naturally, we should make efforts to slow down the erosion of our privacy, but the war is lost. One can regulate Google and Facebook, but it’s impossible to regulate foreign governments or startups with nothing to lose. Moreover, the same algorithms and services that invade our privacy the most are also the most bloody useful.
We want health apps to help us track our menstrual cycles and activity levels, help us get healthy, and so on. We want those apps to work and make our lives better. But now the question is, do we want the owners of those apps to have a monopoly on this knowledge? Shouldn’t we want this knowledge to be available to academics, and maybe---with our consent---to companies that can help us benefit in various ways? 23andMe, for example, has access to a lot of genetic information. And perhaps the Gates Foundation would be interested in using this data to cure malaria in Africa, where 23andMe has few customers. 
So rather than demanding that online platforms hoard our data and don’t share it with anyone, we should destroy data monopolies, and make sure that no single company has access to an enormous amount of data that no one else has access to. Otherwise, power would concentrate and this data would not be used to our benefit as much as it could be. And then, of course, we should regulate what can be done with the data, along with educating people about what are good and bad uses of their data. As citizens, we should have access to our data generated on Google, Facebook and other places, and be allowed to take it elsewhere.


PTI: Thanks very much for your time. These are important issues and we appreciate your willingness to discuss them.

Nathanael Fast