Image: Princeton AI
That a humble teenage chatbot named Eugene "passed" the Turing Test for the First Time in History made for ideal headline fodder on a relatively quiet Monday morning: The Washington Post termed it a "landmark." PC World called it a "milestone." Some other humans, including myself, call it bullshit.
That may be a bit harsh, but seeing as how the Turning Test has been anointed in our popular mythology as "The Moment Machines Become Smarter Than Humans," a stab at a blunt appraisal is probably in order. The technological achievement may be worth celebrating, but it's far from a true landmark: It's another predictable step towards artificial intelligence that might not even be all that intelligent. Just spend five minutes with Eugene, and you'll see for yourself.
'Eugene Goostman' is a bot that was developed in Saint Petersburg, Russia, in 2001 by Vladimir Veselov of Princeton AI, and Ukrainian-born Eugene Demchenko. Powered by a supercomputer, the program is designed to emulate the language and conversation patterns of a Ukrainian teenager who is learning English.
This weekend, at the Royal Society in London, Eugene was subjected to the modern incarnation of the Turing Test. It asks Alan Turing's seminal 1950 question—can machines think?—by pitting thirty judges against Eugene, other AIs, and with real humans, in a speed-dating-like series of five-minute conversations in which the judges don't know who is behind the curtain.
When it answered questions alongside a real person, 33 percent of the evaluators apparently mistook Eugene for the human—thus fulfilling the requirements for an artificial intelligence to "pass" the test. Some didn't expect this to happen for years. In 2002, Mitch Kapor placed a $20,000 bet against Ray Kurzweil: "By 2029," Kapor wagered, "no computer -- or 'machine intelligence' -- will have passed the Turing Test."
It's still unclear if Kurzweil will be $20,000 richer. There are a lot of questions about Eugene's validity, however, as well as persistant problems about the formulation of the test itself—and larger questions about whether we're even asking the right questions about how smart AI really is, or should be, in the first place.
Still, I thought the easiest way to wade into the AI tangle was to chat with Eugene myself. He's online, at Princeton AI's website, though he's been moving slowly ever since the big news broke last night. Even at his best, let's just say that Eugene made for a poor conversation partner.
Note that the field clears when you hit reply; I've re-pasted my questions, in their exact wordings into the field for ease of viewing.
I started out asking Eugene some fairly simple questions. If I were a judge, I think it would have taken me thirty seconds to out Eugene as a spam bot. I asked him questions that were extremely straightforward, and he failed:
I asked him questions that previous Turing Test judges have asked, and he failed:
I then posed a question inspired by what be the most famous cinematic representation of the Turing Test, that ultimate Voigt-Kampff question from Blade Runner. The full question is, of course, "Describe in single words only the good things that come into your mind about... your mother."
I figured I should avoid the complex syntax and go easy on poor Eugene, lest I cause any Replicant-style fallout.
Still, he failed:
I also had plenty of more straightforward "conversations" with the bot, allowing him to guide us towards friendlier topics. But he didn't want to talk about politics (he's a teenager, so fair enough). He kept asking me about my profession; he had heard of Miley Cyrus, he insisted, but referred back to her as 'Cyrus', and he didn't want to talk about her; he couldn't describe his favorite singer's music besides to say he didn't like Britney Spears, and nothing confused him quite like questions about outdoor recreation.
His stock of deflections ran out quickly.
Now, the version of Eugene that's online may not be the latest updated version of the bot. (I've sent questions to Dr. Veselov through the event organizers and will update when I hear back.) Amazingly, Eugene won the Turing competition back in 2012, too, when he fooled 29 percent of the judges, which was one percentage point short of actually 'passing' the test.
At no point did I ever feel like I was having a conversation with anything resembling a human being. Sure, I knew that Eugene wasn't a human from the start, so I was biased, and, unlike the judges, I regarded the shortcomings as programmatic flaws, not potential language barrier issues.
Therein lies another point of controversy about Eugene—his developers intentionally designed him as a teenager whose first language would be foreign to most of the judges, to increase his chances of "passing." It's a clever bit of built-in trickery, and most people don't think it's against the rules, necessarily, but then again, why would you want to game the Turing Test like that? And what, exactly, does it prove that the Turing Test can successfully be gamed?
The Turing Test Is Kind of a Joke Too
This is as opportune a moment as any to point out that the Turing Test has become highly arbitrary, and the significance of any program "passing" it is, at best, ambiguous.
In fact, the test has been evolving over the years, taking on parameters and benchmarks that never appeared in its original conception. Initially the test was outlined by computing godfather Alan Turning as "The Imitation Game" in a 1950 paper. As Professor Murray Shanahan, of the Department of Computing at Imperial College London, explains to Kelly Oakes, there are a number of eyebrow-raising deficiencies between the contest Eugene just passed and the one imagined by its founder.
First, there was never any mention of passing the test with a 30 percent foolin' rate—Turing said the mark of success would be when “the interrogator decide[s] wrongly as often when the game is played [between a computer and a human] as he does when the game is played between a man and a woman,” Shanahan said.
More importantly, the five-minute time limit was never included in Turing's paper; it was likely designed to prevent the AI competitions from becoming unbearable slogs. Given an extra 10 minutes with Eugene, it's unlikely a single judge would ever have been fooled. And clearly, the parameters of 'judging' are fairly arbitrary in the first place. Really, the entire test is more of a thought experiment than an all-important benchmark about achieving "strong AI"—it just so happens to have been written into tech lore as such.
Has anybody checked whether the reason the Turing Test has been "beaten" is because the humans got dumber?— Anil Dash (@anildash) June 9, 2014
Still, there is certainly something very interesting about the fact that a computer tricked at least 10 judges into thinking it was human, even for five minutes. Our programs are becoming smart enough to convince us that they're a foreign-exchange students for a bit, and the bots are beginning to grasp the basic contours of our language.
But even that doesn't necessarily portend any nascent dawn of artificial intelligence. In a great recent essay, Frederick deBoer, a rhetoric scholar at Purdue University, explained why he's a "crabby patty about AI and cognitive science."
Essentially, there's a huge, unanswered question about AI lurking in all the ubiquitous discussion about it—that is, what exactly do we want it to be? Do we want our AI to trick and delight us in chatrooms, mining statistics and data in order to produce human-like behavior? Or do we want something that actually "thinks" like a person?
As deBoer writes, "What we need to find out—and what we have made staggeringly little progress in finding out—is how the human brain receives information, how it interprets information, how it stores information, and how it retrieves information. I would consider those minimal tasks for cognitive science, and if the purpose of AI is to approximate human cognitive function, necessary prerequisites for achieving it."
Then he describes Eugene pretty nicely:
"In contrast, you have the Google/Big Data/Bayesian alternative. This is a probabilistic model where human cognitive functions are not understood and then replicated in terms of inputs and outputs, but are rather approximated through massive statistical models, usually involving naive Bayesian classifiers. This is the model through which essentially every recommendation engine, translation service, natural language processing, and similar recent technologies works.
When these tactics are applied to the project of human-emulation, you get something that looks like Eugene. In other words, the program is not really artificially "intelligent," per se—it's simply the best-programmed human chameleon yet. That may certainly be an achievement, but it's not necessarily AI.
That's part of the reason that some outlets like Buzzfeed and Techdirt are right to poke a hole in the hullabaloo. It's also, somewhat paradoxically, why so many other outlets are entirely justified in their excitement about the event—it is exhilarating that our computers are getting so good. And we need these milestones, I think, to continue to get excited about the development of information technologies; to keep us looped into the current state of AI, to accelerate funding for daring AI research at a time when much of it is locked up in product development labs.
And because we humans need to start thinking a bit more carefully about what, exactly, AI even is. That's certainly a question no bot can answer, not yet at least.