Carnegie Mellon University

machine learning

September 04, 2019

Rapid AI Growth Sometimes Happens at the Expense of Scientific Scrutiny, Tepper School of Business Research Shows

Noelle Wiker

New Paper Calls for More Rigorous Testing of Datasets, Results

White-hot growth in the field of machine learning has sometimes outpaced the scrutiny traditionally associated with scientific inquiry, leading to outsized claims about the progress that has been achieved in the field, cautions a new paper from the Tepper School of Business.

The paper, which focuses on reading comprehension functions in artificial intelligence, underscores the need for greater scrutiny of claims made in machine learning research, explains Zachary Chase Lipton, Assistant Professor of Operations Research and Machine Learning, who coauthored the paper along with Divyansh Kaushik of Carnegie Mellon’s Language Technologies Institute.

Lipton and Kaushik examined five datasets created for passage-based question-and-answer functions: bAbl, SQuAD, CBT, CNN, and Whodid-What. They tested the datasets rigorously, checking to see how well a model performed if it couldn’t see the question that was posed, if the passage it was scanning was gibberish, or if it only had to look at the last sentence to find the answer, for example.

“What we found was a little bit unsettling,” says Lipton.  “At one time, these models were said to be state of the art, but it turned out there was some amount of redundancy in the dataset or predictability in the way the dataset was constructed — it was sort of obvious what the question would be asking.”

Therefore, the claims about advancing AI might have been somewhat inflated, he explains.

“The field’s exploding in a way that allows a lot of people to get ahead of themselves,” says Lipton. “In the haste to explore new territory, there were some oversights of precisely what problem we’re solving in the first place, and of elbow-grease experiments to verify results. Our paper has served that role of going back and saying, ‘Let’s take a deep breath, and ask those questions and get some definitive answers.”

The research, “How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks,” won best short paper out of approximately 1,000 entries at the 2018 Conference on Empirical Methods in Natural Language Processing in Brussels, Belgium. 

Lipton notes that while machine learning is a field that moves quickly and sometimes gets ahead of itself, it also is receptive to constructive criticism — as reflected in the conference award. 

“I think that’s a pretty healthy sign,” he says.