We often think of data as a neutral, objective view of the world, but this isn’t exactly true. Sometimes, our own human biases can make their way into how we create algorithms to analyze data or what data and tags we use to train our algorithms. On the latest episode of the podcast “Women in Data,” Itoro Liney—Data Science Manager at Principality Building Society and AI Ethics Researcher—sits down with host Karen Jean-Francois to talk about the impact of algorithmic bias and what data scientists can do to mitigate its effects in their work.
The Impacts of Biased Data
ML algorithms can sometimes produce biased results. Some of the most famous examples in recent years are the following:
- In 2016, ProPublica published an investigation into the COMPAS algorithm, something law enforcement agencies used to predict the likelihood of recidivism. The algorithm incorrectly predicted more black defendants at higher risk than white defendants.
- In 2018, Amazon was forced to abandon its AI recruiting tool. The algorithm trained on resumes mostly from men, leading to a biased recommendation that favored male applicants.
- In 2019, Apple was criticized for giving lower credit limits to women than men for its new credit card, based on Goldman Sachs algorithms that didn’t consider things like personal income.
According to Liney, algorithmic bias occurs when the model treats people differently based on “protected” categories—think gender, race, or nationality. Models often take the shortest route to come to conclusions, which may lead to faster results but also some strange connections. These connections don’t need to be explicitly built into the model.
See also: 6Q4: Responsible AI Institute’s Seth Dobrin on Attesting that Your App Does No Harm
A simple example of how bias appears
Liney breaks it down for us. Imagine someone building a model to predict how expensive a house is. The model notices that houses with more rooms tend to be more expensive. In many cases, a house with five bedrooms might be more costly than a one bedroom.
Now imagine that a very large house turns up. It’s an open floor plan. The model categorizes this big house as very cheap because it only has one bedroom. The model gets its outcome wrong because it’s connected something trivial to the outcome.
This might be a trivial example, but these are very real concerns. When it happens with markers that point to a particular race or gender, for example, it can cause very different treatment for different people based solely on those things.
We need to understand what the algorithm is intended to do and recognize unintended outcomes when they appear. In many of the examples above, these algorithms latched onto a marker to make a decision, leading to direct harm for those involved.
Liney believes that there’s more to data science work than just the technical aspects. She’d like to see more work and conversations in the applied ethics question to help even beginning data scientists avoid unintended biased outcomes.
How algorithms can amplify bias
It’s more than just the data that’s in question. Algorithms themselves can replicate implicit biases. Back to the example of the house, sometimes there aren’t enough appropriate samples to fully train the model to understand what provides value in a property. It doesn’t learn what makes this particular “one bedroom” house different from the average one bedroom. Even worse, it can ignore outliers like this; the algorithm generalizes so much that it makes the bias worse.
It’s easy to spot bias in this type of example. In the real world, we’re talking about massive data and highly complex algorithms. So how do data scientists stay vigilant?
Practical advice for building better algorithms
Liney mentions two ways to tackle bias:
- Before you perform the analysis: What data do you have coming in? Do you have significant disparity between groups? Have you performed descriptive analytics before jumping into modeling?
- After analysis: Can we analyze the result of the algorithm by different protected characteristics—age, gender, race, etc? Are you seeing the same accuracy between these groups?
According to Liney, it’s vital to understand the data you have. Where does it come from? Does it apply to general reality, or is it focused on one particular location or people? In some cases, the data simply isn’t appropriate for the analysis you want to do. For example, you might design a predictive maintenance algorithm to manage the maintenance of airplanes in North America. However, even if the data is for airplane maintenance, that model may not work for an area like Siberia, where the environment is vastly different.
3 ways to account for algorithmic bias right now
For Liney, aiming for explainability can help mitigate some of the worst effects of algorithmic bias. It may be fun to go for the most complex algorithm, but data science in the real world can happen with more explainable models. Complex models that are simply black boxes don’t make it easy to screen for bias.
Next, she encourages leaders to hire ethically-minded data scientists and analysts. She hopes that this will become a normal part of the hiring process. It’s not always about who can build the most advanced or complex model but who deeply understands how data and models can become biased and can act quickly to change course.
She also suggests actively finding proxies for categories that could accidentally or implicitly identify protected groups. Implicit bias could occur with addresses, for example, where outcomes for lower-income areas might provide a skewed result.
Continuing the work of ethics to prevent algorithmic bias
This is an ongoing conversation. Liney is hopeful that more people will continue the work of AI ethics and algorithmic bias to find balance between things like privacy and data availability. These conversations are vital to ensuring that what we create in artificial intelligence will always be for our highest good.
To hear the conversation and more details about Liney’s work in AI ethics, listen to the podcast here.
Elizabeth Wallace is a Nashville-based freelance writer with a soft spot for data science and AI and a background in linguistics. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain – clearly – what it is they do.