To fix discrimination, we need to see it

By Narayana Kocherlakota | Bloomberg Opinion

Academic research has a big role to play in addressing the problem of racial and gender discrimination that the Black Lives Matter and Me Too movements have made newly urgent: To root it out, policy-makers need good evidence on where and how it’s doing damage.

Unfortunately, researchers’ supposedly neutral methods are biased against finding bias. This needs to change. Suppose we’re studying bias in employment. Managers are regularly presented with pairs of equally qualified Black and White candidates. If the process isn’t biased against Black people, the probability of choosing someone of one or the other race should be the same as a coin toss, 50%. If it is biased, there would be a higher probability (say, 60 percent) of choosing a White person—and there would need to be a policy intervention to fix the process.

Typically, we can’t know what the actual probabilities are. We have to estimate them by studying a sample of the hiring decisions. In doing so, we need to be aware of two kinds of potential mistakes. We might find bias where there is none, and hence erroneously recommend action. Or we might miss bias that actually exists, and thus fail to act.

Say, for example, we’re looking at a sample of 20 decisions, in which 14 White candidates have been hired. On the surface, this looks like evidence of bias. But it might just be a random deviation: Maybe 14 Black candidates will be selected out of the next 20. In drawing a conclusion, we must weigh the risk of finding bias where it isn’t (what statisticians call a type one error) against the risk of failing to find bias when it actually exists (a type two error).

These risks are calculable. If we set the threshold for finding bias at 14 out of 20, there’s a 6 percent probability of a type one error and a 75 percent probability of a type two error. If we set the threshold at 13—which is more likely to lead to an intervention—the probabilities shift to 13 percent and 58 percent, respectively. So which threshold should we choose? There’s no neutral answer. If we’re more conservative, and hence more concerned about recommending unnecessary actions, then we’ll go with the higher bar. If we prioritize social justice, then we’ll set the bar lower—in part because we believe that the cost of failing to act when bias exists is much greater than the cost of acting when it doesn’t.

One might expect analyses of racial and gender discrimination to use a range of thresholds, depending on researchers’ and policy-makers’ priorities. But they don’t. Investigators are taught to keep the probability of a type one error below a certain level, usually 5 percent or 1 percent, with no reference to the probability or consequences of making a type two error. They—and pretty much any journal considering publishing their results—would see the “social justice” threshold as too aggressive, merely because it entails a 13 percent chance of a type one error.

Such an approach might seem scientifically consistent, in the sense that different investigators would reach the same conclusions from the same data. But it makes no sense. Given the long history of racial and gender discrimination, there’s ample reason for a researcher to believe that type two errors—failure to recommend action—are more costly. Or the investigator might well have evidence of bias from outside the sample. In either case, the social justice criterion would be more appropriate.

Don’t get me wrong: Researchers have convincingly demonstrated the existence of racial and gender discrimination in many contexts. My point is that we should be cautious about treating studies that don’t find evidence of bias as a justification for doing nothing.

Social scientists need to recognize that they aren’t making conclusions about bias (and really any question of importance) in a vacuum. The past matters, in the sense that there is prior non-data evidence. And the future matters, in the sense that the analysis will influence decisions that have social costs and benefits. Taking all this into account might lead different researchers—and policy makers—to read the same data differently. But, overall, they’ll also be more likely to do more good.