Can better data make AI smarter?

IBM has developed and released a new data set with a million images for training the artificial intelligence that drives facial recognition systems.

There is a rush on everywhere to use artificial intelligence to augment human decision making, and facial recognition systems are one of the hot areas for AI.

But there have been issues as described to me by Claude Yusti, who leads IBM Federal’s artificial intelligence practice.

Facial recognition systems have been deployed by using AI but have been troubled by mistakes and wrong identifications of people. It has been a challenge to get the facial recognition systems to meet standards of fairness.

“The challenge isn’t the algorithms but that the set of information used to train the AI that is flawed,” Yusti told me.

The data sets just haven’t been as comprehensive and diverse as they need to be.

To address that challenge, IBM and its Watson research team have created a new data set called Diversity in Faces. The DiF has a million facial images and each face is annotated with 10 coding schemes that include aspects such as craniofacial features and predictors for age and gender.

The images are from the Yahoo Flickr Creative Commons 100 million dataset and licensed for public use.

The DiF data set will be available to researchers who are sharpening the AI in facial recognition systems.

While not a direct source of revenue for IBM, the company undertook the initiative because it was needed, Yusti said.

“For AI to be adopted, you have to trust in the AI and this kind of initiative helps build that,” he said. “If we engage in this properly, the opportunity for everyone, including IBM, increases.”

Yusti said that the DiF data set and the work IBM is doing with AI really is about building trust.

Users need to understand how AI reaches the answers that it does. When it comes to government in particular, citizens need to feel they are being treated fairly.

One issue with facial recognition is that the data sets being used to train the AI were not deep and diverse enough. For example, they may not have had enough Hispanic faces in it and couldn’t represent that wide variety of Hispanic faces.

This problem has led to systems misidentifying people and not recognizing them, Yusti said.

The work on the DiF data set is IBM’s effort to build trust by helping AI systems become accurate, he said.

“It’s the right thing to do to serve our clients,” he said. “And it helps people embrace and adopt this capability because it is trustworthy and usable.”