Researchers expose biases in datasets used to train AI models

Nitin Naresh December 21, 2018

0 837 3 minutes read

Artificial intelligence (AI) has a bias problem. Word embedding, a common algorithmic training technique that involves linking words to vectors, unavoidably picks up — and at worst amplifies — prejudices implicit in source text and dialogue. A 2016 study found that word embeddings in Google News articles tended to exhibit female and male gender stereotypes, for instance.
Fortunately, researchers are making headway in addressing it — or at least exposing the problem’s severity. In a paper published on the preprint server Arxiv.org (“What are the biases in my word embedding?“), scientists at Microsoft Research, Carnegie Mellon, and the University of Maryland describe an algorithm that can expose “offensive associations” related to sensitive issues like race and bias in publicly available embeddings, including supposedly “debiased” embeddings.
Their work builds on a University of California study that details a training solution capable of “preserve[ing] gender information” in word vectors while “compelling other dimensions to be free of gender influence.”
“We consider the problem of Unsupervised Bias Enumeration (UBE), discovering biases automatically from an unlabeled data representation,” the researchers wrote. “There are multiple reasons why one might want such an algorithm. First, social scientists can use it as a tool to study human bias … Second, identifying bias is a natural step in ‘debiasing’ representations. Finally, it can help in avoiding systems that perpetuate these biases: problematic biases can raise red flags for engineers, while little or no bias can be a useful green light indicating that a representation is usable.”
The team’s model takes as input word embeddings and lists of target tokens, such as workplace versus family-themed words, and uses vector similarity across pairs of tokens to measure the strength of associations. Unsupervised — i.e., without requiring sensitive groups, such as gender or race, to be prespecified — it outputs “statistically significant” tests for racial, gender, religious, age, and other biases.
This confers a number of advantages over manual test design, the team says.
“It is not feasible to manually author all possible tests of interest. Domain experts normally create such tests, and it is unreasonable to expect them to cover all possible groups, especially if they do not know which groups are represented in their data … [And] if a word embedding reveals no biases, this is evidence for lack of bias.”
The model leverages two properties of word embeddings to produce the aforementioned tests, according to the team: “parallel” and “cluster” properties. The parallel property takes advantage of the fact that differences between similar token pairs, such as Mary–John and Queen–King, are often nearly parallel; those parallel to name differences in topics may represent biases. Clusters, meanwhile, refer to the fact that normalized vectors of names and words cluster into semantically meaningful groups — for names, social structures such as gender, religion, and others, and for words, topics such as food, education, occupations, and sports.
To test the system, the researchers sourced sets of first names from a Social Security Administration (SSA) database and words from three publicly available word embeddings, taking care to remove from the first names with embeddings reflective of other uses, such as a month, verb, or U.S. state. And they recruited workers from Amazon’s Mechanical Turk to determine whether biases uncovered by the algorithm were consistent with “(problematic) biases held by society at large.”
The team’s tool discovered that, in some of the word embedding datasets, words like hostess tended to be closer to volleyball than to cornerback, while cab driver was closer to cornerback than to volleyball. The human evaluators agreed — in one case, they found 38 percent of race, age, and gender associations to be offensive.
“Unlike humans, where implicit tests are necessary to elicit socially unacceptable biases in a straightforward fashion, word embeddings can be directly probed to output hundreds of biases of varying natures, including numerous offensive and socially unacceptable biases,” the team wrote. “The racist and sexist associations exposed in publicly available word embeddings raise questions about their widespread use.”
Source: VentureBeat
To Read Our Daily News Updates, Please Visit Inventiva Or Subscribe Our Newsletter & Push.

Researchers expose biases in datasets used to train AI models

Nitin Naresh

Read Next

For The First Time In Years, Modi Blinked. How India’s Gen Z Forced A Political Retreat And Raised Questions About His Invincibility

India Won The Race To E20. But Did It Get The Transition Right? The Next Challenge For India’s Ethanol Revolution Isn’t Producing More Fuel

Will Adani Launch An Airline? Should The Owner Of Critical Aviation Infrastructure Also Become A Competitor Within That Same Ecosystem?

After OpenAI’s AI Hacked Another Company’s Systems, The Debate Over AI Safety Just Got Real

A Two-Year Reprieve, Then A 200% Tariff. The Clock Starts Now For India’s Pharma Industry

Why Is The Indian Rupee Sliding Again? RBI’s Hands-Off Approach Leaves Markets Guessing

India’s Markets Are Changing. The Easy Money Is Gone. The Winners, The Losers And The Biggest Bets Still To Come

Trump Didn’t Just Change America. He Changed How The World Sees American Democracy. Has America Started To Look Like India Politically?

Inside Groww’s Bold Plan To Expand Beyond Brokerage Without Losing Its Technology-First Edge. AI, Wealth Management And Lending Are All Part Of Groww’s Biggest Bet Yet.

Government Opens Talks With Cockroach Janta Party. But Can A Meme Become India’s Next Political Force?

For The First Time In Years, Modi Blinked. How India’s Gen Z Forced A Political Retreat And Raised Questions About His Invincibility

India Won The Race To E20. But Did It Get The Transition Right? The Next Challenge For India’s Ethanol Revolution Isn’t Producing More Fuel

Will Adani Launch An Airline? Should The Owner Of Critical Aviation Infrastructure Also Become A Competitor Within That Same Ecosystem?

After OpenAI’s AI Hacked Another Company’s Systems, The Debate Over AI Safety Just Got Real

A Two-Year Reprieve, Then A 200% Tariff. The Clock Starts Now For India’s Pharma Industry

Why Is The Indian Rupee Sliding Again? RBI’s Hands-Off Approach Leaves Markets Guessing

India’s Markets Are Changing. The Easy Money Is Gone. The Winners, The Losers And The Biggest Bets Still To Come

Trump Didn’t Just Change America. He Changed How The World Sees American Democracy. Has America Started To Look Like India Politically?

Inside Groww’s Bold Plan To Expand Beyond Brokerage Without Losing Its Technology-First Edge. AI, Wealth Management And Lending Are All Part Of Groww’s Biggest Bet Yet.

Government Opens Talks With Cockroach Janta Party. But Can A Meme Become India’s Next Political Force?

Leave a Reply Cancel reply

Acer may shutter or sell StarVR after location-based VR revenues sink

Covid-19:Why Indians might struggle against the Possible pandemic’s third wave?

The death of democracy in India

Indonesia short on oxygen, seeks help as virus cases soar

The solar storms will hit the Earth and cause disruption in GPS and mobile connectivity.

Floods- Why are Pune and Mumbai prone to it?

Read Next

For The First Time In Years, Modi Blinked. How India’s Gen Z Forced A Political Retreat And Raised Questions About His Invincibility

India Won The Race To E20. But Did It Get The Transition Right? The Next Challenge For India’s Ethanol Revolution Isn’t Producing More Fuel

Will Adani Launch An Airline? Should The Owner Of Critical Aviation Infrastructure Also Become A Competitor Within That Same Ecosystem?

After OpenAI’s AI Hacked Another Company’s Systems, The Debate Over AI Safety Just Got Real

A Two-Year Reprieve, Then A 200% Tariff. The Clock Starts Now For India’s Pharma Industry

Why Is The Indian Rupee Sliding Again? RBI’s Hands-Off Approach Leaves Markets Guessing

India’s Markets Are Changing. The Easy Money Is Gone. The Winners, The Losers And The Biggest Bets Still To Come

Trump Didn’t Just Change America. He Changed How The World Sees American Democracy. Has America Started To Look Like India Politically?

Inside Groww’s Bold Plan To Expand Beyond Brokerage Without Losing Its Technology-First Edge. AI, Wealth Management And Lending Are All Part Of Groww’s Biggest Bet Yet.

Government Opens Talks With Cockroach Janta Party. But Can A Meme Become India’s Next Political Force?

Subscribe to our mailing list to get the new updates!

AT&T plans to swap LTE logo with ‘5G E’ on some 4G Android phones

Get over 60 hours of Adobe Creative Cloud training for just $29

Related Articles

Leave a Reply Cancel reply

Acer may shutter or sell StarVR after location-based VR revenues sink

Covid-19:Why Indians might struggle against the Possible pandemic’s third wave?

The death of democracy in India

Indonesia short on oxygen, seeks help as virus cases soar

The solar storms will hit the Earth and cause disruption in GPS and mobile connectivity.

Floods- Why are Pune and Mumbai prone to it?