The Plan to Replace the Turing Test with a ‘Turing Olympics’

When is a robot as intelligent as a human? When it can build flatpack furniture.

Victoria Turk

Victoria Turk

​Image:​ I K O/Flickr

The Turing Test is no longer fit for purpose. Indeed, it perhaps never was—when Alan Turing introduced the idea in a 1950 ​paper, he suggested it as a philosophical exercise to explore the question, "Can machines think?" rather than any practical assessment of artificial intelligence.

But it's since become known as a benchmark for bots wanting to prove their smarts. The concept is simple: If a computer program manages to sail through unrecognised when pitted against a human in a text conversation, it passes the test.

The problem is, that's not really a true marker of intelligence. Last year, a relatively crude chatbot pretending to be a 13-year-old Ukrainian boy manag​ed to pass it.

At the weekend, a group of AI experts met to discuss how to move past​ the Turing Test in a workshop at the AAAI Conference on Artificial Intelligence in Austin, Texas. They set out with the aim of crafting a "Turing Championship;" a new set of challenges that could help push AI research.

Gary Marcus, a professor of psychology at NYU and a chair of the workshop, told me there was general agreement that the Turing Test had reached its expiry date. "Basically what I argued is that it's really an exercise in deception and evasion," he said.

As demonstrated by the Eugene Goostman chatbot that claimed a victory last year, it's relatively easy for a bot to game the system—by pretending English isn't its first language, for instance, or simply making use of admittedly human-like ploys such as changing the subject. And at the end of the day it only has to fool a human judge into thinking it's as believable as one other human, which isn't the most difficult heist to pull.

The new set of challenges would be less a Turing Test, more a Turing Olympics

The idea of the workshop group is to develop not just one test but a series of challenges that could assess different kinds of intelligence, beyond chatbot-level communication skills. "There are many things that AI has progressed lately, and we want to challenge those as well—like vision, speech recognition, natural language processing, and so on," said Francesca Rossi, a professor of computer science at the University of Padua in Italy, who spoke to me over Skype alongside Manuela Veloso, professor of computer science and robotics at Carnegie Mellon University.

The new set of challenges would be less a Turing Test, more a Turing Olympics.

One challenge the group is considering is what Marcus referred to as the "Ikea test." It sounds like the punchline to a joke: How do you know when a machine is more intelligent than a human? When it can follow the instructions to build flatpack furniture. But that's pretty much the idea.

The robot would have to be capable of seeing the parts, interpreting the instructions, and, eventually, have the motor skills to put it together. I asked Marcus if this task would, then, require an actual physical robot. "Well, a physical robot guided by an AI program," he said, and added that the event could have different tracks and would likely start with simulations before moving to robots—"So to really be the ultimate winner you'd have to do it with an actual robot, actual objects." Veloso added that they could consider human collaboration as an element to this test.

Another proposed event is the Winograd Schema, a language-based test that requires something like human common sense. This was proposed by computer scientist Hector Levesque in a 2011​ paper, and last year US software company Nuance announced sponsorship of an annual Winog​rad Schema Challenge.

The Winograd Schema gives participants a sentence, then asks a simple question about that sentence. Levesque gives an example:

The trophy doesn't fit in the brown suitcase because it's too big. What is too big? (The trophy or the suitcase?)

These questions are really easy for humans, but actually require a relatively deep understanding of language—Marcus described it as "sentences which you can't really understand unless you understand the world." Crucially, they're not Googleable.

A third test was suggested by Marcus himself in a piece for the New​ Yorker. He proposed asking a computer program to watch a video it hadn't seen before, and answer a question about it; like "Why did Russia invade Crimea?" or "Why did Walter White consider taking a hit out on Jessie?" He told me that Fei-Fei Li, director of the Stanford AI Lab, had a similar idea using images, and that they had decided to join forces to create an event where a machine would have to answer "journalist-type" questions about images, video, or audio.

Again, these skills come easy to a human being so long as you're paying attention, but taking input from one format and responding to questions about it in another requires some real comprehension. You can't just bluff it like a chatbot and hope no one noticed you changing the subject.

How do you know when a machine is more intelligent than a human? When it can follow the instructions to build flatpack furniture.

Other ideas that came out of the workshop included challenging AI to play a new videogame as well as a 12-year-old child, or tasking a digital teacher to learn a new topic and teach it as well (or better than) a human.

The group hopes to launch the first championship next year after a second workshop session at the IJCAI conference in Buenos Aires in July. They plan to start with three or four tests, and add more as appropriate in later years.

"We don't envision the same computer program doing the four things at this time, the first year," Veloso said. At the moment, completing even one of the tasks is an intentionally difficult prospect—after all, the whole point is to push AI research further.

"Even for each category maybe we can envision challenges that are increasingly more difficult," said Rossi. With the video test, you could start with a restricted selection of videos and gradually make it broader.

Marcus suggested adjusting the goalposts of what it means to pass the tests. Does a computer have to match a child, an average person, or an expert in the field in order to be considered a winner? These levels could be adjusted to monitor continued progress. "It's certainly possible there'll be superhuman performance," said Marcus. "We could easily imagine on the Ikea challenge that some robot could be a lot better than people are at putting together Ikea things." Granted, that's not necessarily the highest bar.

A main challenge for the organisers now is defining the rules of the challenges. Rossi and Veloso said they will likely unite the technical community to propose initial rules, then give prospective participants and other interested parties a chance to comment before finalising them. As a former president of the International RoboCup Federation—an annual soccer World Cup for robots—Veloso has experience in this area.

While everyone hopes to see an artificially intelligent bot ace the tests at some point, that's really just an incentive along the way to another goal. "The point of the test is not really to pass the test," said Rossi. "The final goal is to advance AI; to make machines more intelligent."

Maybe one day we'll have a bot that's capable of building your TV stand, recapping the episodes you missed, and playing a few word games just for fun. At which point we might want to start thinking about how far we re​ally want to incentivise advancement in AI