What Do Machines Hear When They Listen to Music?

We're closer to finding out.

|
Jul 14 2016, 10:00am

Image: Flickr/Sebastian Dooris

The hot new trend in self-learning algorithms—a technology that's embedded in our phones, our social networks, and more—is trying to figure out how the hell it works.

The thing is that the algorithms known as neural networks are essentially black boxes. We've developed the high-level concepts that govern them and designed the networks themselves, but picking apart decisions that they make on their own is intensely difficult due to their internal complexity. As impressive as these systems are, however, they're not perfect, and to make them better we need to understand what makes them tick.

The latest attempt at tearing the top off of a computational black box was published to the ArXiv preprint server this week by researchers at the Queen Mary University of London in the UK. They took a peek inside how a neural network understands music genres. The results were highly interesting (if a bit inconclusive). For example, we now know that machines pay attention to drums before they listen for pianos and voices.

"It's always good to know what's going on there, whether or not it is useful now or in the future"

This process helps researchers to understand "how we should design the system for a certain task," lead researcher and PhD student Keunwoo Choi wrote me in an email. "And of course it's always good to know what's going on there, whether or not it is useful now or in the future," he added.

Work in the area of figuring out how deep learning works has been ongoing since at least 2009 when a team of researchers proposed a method for generating images specifically designed to activate individual "neurons." The idea is by working backwards with images that networks find appealing, we can find out why. In 2014, researchers designed a way to actually visualize the network's layers themselves, to find out what activates them.

To date, the most successful attempts have involved tearing neural networks apart, layer by layer of digital "neurons," to find out what they're learning to pay attention to in a stream of information. For neural networks that process images, the idea has been to feed new images into the AI and then visualize what each layer detects in the image, from vague outlines at the lowest layer to fully formed images at the top output layer.

This is like that, but for music.

At the base layer, the authors write that the network appeared to extract the most extreme element in the music: percussion. The second layer paid attention to basic harmonic components of the music, particularly the bass notes. The third layer then made distinctions between different instruments, and was highly activated by voice and piano, but not by hi-hats for example.

Think about that. That's pretty cool. Basically, we now know that when machines are listening to music, they "hear" the drums first, then some harmonies, and then pay attention to which instruments are being used.

But there's still a long way to go in this regard.

"As the layers go deeper, it was hard for me to understand," Choi wrote me. "It's like—it is relatively easy to analyse what's going on in our eardrums and ear canals compared to analysing the cochlear, which is easier than analysing the brain."

Choi and his colleagues' paper won't be published in any journals: It failed the peer review process at the 2016 Machine Learning for Signal Processing conference because reviewers felt it was not a big enough discovery to be published. The reviewer's notes, which Choi published online, note that "this would make a fantastic blog post, but I don't see the novelty, analysis, or conclusions that warrant a scientific publication."

And here's the end of the blog post.