Welcome to this module on algorithmic fairness. Often we say let the data speak for themselves. What could be less biased, what could be more fair than just following the data and letting the data take us wherever they may lead. Well it turns out that algorithms can be biased. And there are many ways in which these biases can arise. You can have these biases because the training data set isn't representative of the population. Because the past population isn't representative of the future population. Because there are confounding processes that lead to correlations that are flukes. Let's begin by considering a simple example. So here's a company that has only 10% women employees, and it has a boys' club culture that makes it difficult for women to succeed. The hiring algorithm is trained on current data and based on current employee success, it scores women candidates lower. The algorithm is fairly representing what happens today. It's fairly representing that women have difficulty succeeding in this company because of the boys' club culture. But the net consequence of this is that the algorithm is scoring women applicants lower. And so the company ends up hiring even fewer women. This is an algorithmic vicious cycle. It arose because of a bias in the algorithm. And the bias arose in the algorithm because of the data that it was trained on. The net effect is nicely captured in this carton. If you've got a cyber cafe for Macs you're not going to get any data about PCs, and you don't know what kind of services to provide if you're going to make a PC user happy. This kind of thing is now being recognized and it's even there on TV. Such as an episode of the CBS serial, The Good Wife. So what are things that can give us bad results from good data? I want to talk about three different things. First, a notion of correlated attributes. The second, a notion of misleading results, even if they're correct. And third, a more technical thing called P-hacking. Let's begin by talking about correlated attributes. Racial discrimination has been a major historical problem in United States. There are many laws concerning racial discrimination. On the one hand, you have laws that prohibit consideration of race for some things. For example, universities are prohibited by law from considering race in admissions, at least some universities in some states. Now universities may value diversity, may want to consider race, may have affirmative action interests in mind. And so the way that they can deal with this law is to find other features that can get them the diversity that they seek without violating the law. Famously, for example, the University of Texas completely changed the notion of how they were considering race in admissions. And they completely took it out and simply said they'd take the top ten percent of the class in any school in Texas. And given the strong degree of racial segregation in Texas high schools, they knew that there would be many high schools that had minorities as, essentially, the population of the school, and therefore minorities in the top 10%. And this would give them the diversity of student body that they sought without in any way breaking the law. Now the same notion of racial discrimination, that intent is a problem that we've had with lenders in the past didn't want to lend to borrowers who were black. And laws were passed to prevent them from such discrimination. So they found a technique of using surrogates. Specifically, by redlining neighborhoods in the city that were primarily minority. And they could use that idea to get them very close to the racial discrimination that they wished to do. And so now the laws prohibit lenders from explicitly red lining either. That stops one specific notion of surrogate attribute. The point really is, that in general proxy attributes can be found. And if you were a lender who wished to discriminate on the account of race, it's probably not too hard for you to find other attributes that are highly correlated with race and that you can use legally. And you could use that to discriminate if that were your intent. So really, the point is big data provides the technology to facilitate such proxy discrimination. And the saving grace is that big gata also provides the technology to detect and address such discrimination. And so it is up to us as responsible data scientists to make sure that we're using the power of the technology we have to do the right thing. One of the things that we want to do is to stop unintentional discrimination. There is a well-known case of the Staples store that unintentionally discriminated based on users' ZIP codes. And this kind of thing could have been stopped if the right discrimination analyses had been done ahead of time. So in short, intent matters when one is considering discrimination. And one has to be clear about what kind of discrimination one wants to avoid. So to talk about this, let's talk about discrimination as one target group versus everyone else. So what we want to consider as discrimination is if an individual from the target group gets treated differently from an otherwise identical individual not from the target group. So you have a black person or a white person with identical qualifications and one of them gets called back and the other one doesn't for a job interview. Well we can all agree that that is discrimination, but that is discrimination at the individual level. There's also a question of discrimination in terms of aggregate outcome. And here it's a broader definition of discrimination. So you can say, what's the percentage success of the target group measured with respect to the control which is the general population. And you can compare this amongst the candidates, we can compare this as a ratio of the full population universe, as a ratio of the qualified candidates. And the exact results that one gets in each these is a little bit different. But they all show to what extent we are managing the societal outcomes with respect to discrimination. Another thing to keep in mind is that unintentional discrimination is hard to avoid. It happens and we need to use our data analytic techniques to be able to avoid these better. In 2015, experiments at Carnegie Mellon University showed that significantly fewer women than men were shown online ads for high paying jobs. This happen presumably because the ads that were shown were selected based on recent past history of click throughs. And presumably men were clicking on this ads for high paying jobs more often than women were. And so this was again, not intentional discrimination on the part of some company and their choice of ads. But rather algorithmic discrimination that took place because, it was promulgating the status quo.