Thursday morning, while interviewing 19-year-old Elman Mansimov, I had one of those moments. I caught my reflection in my MacBook’s screen and thought, “What the hell am I doing with my life?”
Mansimov is an artificial intelligence wunderkind who just graduated with a degree in computer science from the University of Toronto, which he started at the age of 15. He’s publishing his first paper this year. It’s an impressive enough feat on its own, but the contents of his paper are what really triggered my mini existential crisis.
Mansimov and his older colleagues designed an AI system that can generate all kinds of trippy images, like a stop sign flying in a blue sky, based on captions written in natural language. In other words, you tell the computer to draw something, even something totally ridiculous that doesn’t really exist, and then it does.
"If you want a more realistic AI system, the AI should be able to do that"
The system relies on a neural network—computational nodes meant to mimic the activity of neurons in the human brain, and an important part of many machine learning efforts. Many neural networks take images as inputs and classify them with text, but Mansimov and his colleagues took text as an input and generated images instead.
The set up is kind of like a neural network sandwich. The first network, the foundation, analyzes text and comes up with a “mental image,” or representation, if you will, of what the words mean. The middle layer of the sandwich, a network for generating images, takes that representation and attempts to recreate an existing image in a training set of annotated images, layer by layer.
At the same time, the middle network introduces noise based on the probability of ending up with the image the researchers asked for. “When you imagine a yellow car,” Mansimov explained, “it could be a Mercedes or a BMW, so by adding the noise, it tries to add the distribution of all the possibilities there may be.”
Screengrab: Mansimov et al.
The final layer of the sandwich is a network that sharpens the resulting blurry images. “The model isn’t sophisticated enough to reconstruct sharp samples, and it’s uncertain about what it’s generating, so it replies with uncertainty and blurry samples,” Mansimov said. “It’s like when you’re uncertain what to say, you more or less start mumbling.”
The team, which also included University of Toronto graduate students and professors, first tested their system by challenging it to generate images of buses, and other objects in different colours. Although the images were blurry, it worked. When the researchers asked for an image of a plane flying in blue skies, the sky was indeed blue. When they asked for a rainy sky, it became gray.
Photo: Elman Mansimov
But while it’s easy for humans to imagine all kinds of surreal images in our heads based on things we’ve seen before, machines? Not so much. So, to really test the system’s limits, they asked it to generate images of surreal scenes that didn’t exist in the training set: an open toilet sitting in a grassy field, for example.
The AI excelled at this task, as well. “That showed the whole purpose of this work: to generalize,” Mansimov told me. “If you want a more realistic AI system, the AI should be able to do that too.”
Screengrab: Mansimov, et al.
As for what’s next, Mansimov said they need to work on improving the system so that the images it generates are sharper and more recognizable. In part, this might come from more training. Also, he said, the team will try applying their system to things other than image generation, such as speech.
“As a proxy to see what’s happening with the model, I used images,” Mansimov said, but it could theoretically do the same with the human voice—perhaps, he suggested, to create more natural-sounding voices for use by robots.
If the results of a robot trying to speak using Mansimov and his colleagues’ system sound anything like its images look, we can probably look forward to robots that slur their way through speech like a drunken sailor. But hey, it’s a start.