Why AI Can Still Hardly Pass an Eighth Grade Science Test

The Allen AI Science Challenge shows it takes more than an encylopaedic knowledge to reason your way to a correct answer.

An artificial intelligence competition that asked AI models to answer eighth-grade science questions announced its winners this week—but it doesn't look like robots will be graduating junior high any time soon.

The Allen AI Science Challenge, set up by the Allen Institute for Artificial Intelligence, asked participants to build AI models that could answer multiple-choice questions akin to those found on eighth grade science exams. The winning AI scored a rather lukewarm 59.3 percent—not exactly an A grade.

The task was posed as an alternative to the conventional Turing Test, which is often criticised for being an exercise in duping gullible humans rather than proving any real "intelligence." The point of using an exam-style test was to challenge computer models' language processing and reasoning skills, as they need to be able to first understand the question and then find and apply the correct answer.

But even the highest scoring methods didn't seem to get close to human-style logic or reasoning.

"I think you have to move gradually from this pure search paradigm to something that can take more structured knowledge into account."

"The final scores, we weren't sure what to expect, and I think people did pretty well but nothing sort of Earth-shattering really either," said Allen Institute software architect Oyvind Tafjord, who is responsible for the final verification of the results.

The first place prize of $50,000 was taken by Chaim Linhart, an Israel-based researcher from startup TaKaDu. He goes by the name "Cardal" on Kaggle, the online platform where the competition was run. Second and third place came in just around one percent behind.

Tafjord explained that all three top teams relied on search-style machine learning models: they essentially found ways to search massive test corpora for the answers. Popular text sources included dumps of Wikipedia, open-source textbooks, and online flashcards intended for studying purposes.

These models have anywhere between 50 to 1,000 different "features" to help solve the problem—a simple feature could look at something like how often a question and answer appear together in the text corpus, or how close words from the question and answer appear.

This graph shows the steady improvement of competitors' top scores as the competition went on, which Oyvind explained showed a general slow climb as a result of participants trying out new features and re-running their models before reaching their final system.

As you can see, no one pushed past the 60 percent mark.

"It was unclear what to expect really," said Tafjord. "I was actually quite impressed that they managed to take the information-retrieval, search type of methods and push them quite that far. I would maybe have expected that to go more like to mid-50 percent."

He believes that a different approach is needed to move the scores much higher.

"When you look at these questions, there are certain classes of questions that are clearly sort of definitions, things that if you find the right sentence in a textbook then you basically have the answer," he said. "While other questions require some sort of deeper reasoning, or some kind of picture of the world that you can actually do a little bit of reasoning with. And for that, I think you have to move gradually from this pure search paradigm to something that can take more structured knowledge into account."

For example, in the competition, all of the top 10 models got the following question correct:

Which model is used by scientists to determine the properties of elements?

(A) a Punnett square
(B) the Periodic Table
(C) a pedigree chart
(D) the rock cycle

But few of the top 10 models managed this question:

What do earthquakes tell scientists about the history of the planet?

(A) Earth's climate is constantly changing.
(B) The continents of Earth are continually moving.
C) Dinosaurs became extinct about 65 million years ago.
(D) The oceans are much deeper today than millions years ago.

(The answer to the first question is B, the answer to the second is also B, but most models think that it's actually C).

The question of what counts as true artificial "intelligence" remains, but just like a chatbot tricking a human into believing it's a teenager with poor English skills doesn't seem like a very satisfactory Turing Test win, looking up definitions on Wikipedia lacks a certain sense of true knowledge, even if it ticks the boxes—or 60 percent of them.

Nevertheless, the competition achieved its goal of engaging the AI community in the task, with 780 teams participating. "Probably much more interesting will be what that can spawn in the longer term," said Tafjord. "Hopefully more people are thinking about, 'Yeah, this is actually quite challenging, and how can we make computers take this next step and be more helpful to us?'"