Matrix multiplication and AI: A short primer
Last summer, I spent a week at a conference dedicated to graphics-processing units—GPUs. It was presented by GPU big name Nvidia, a brand that is largely associated with gaming hardware. At the conference, however, gaming was a sideshow. For that matter, graphics themselves (excluding VR) were a sideshow, despite being in the actual name. In general, this was a machine learning conference, and, to most of the attendees, of course it was.
With chipmaker AMD’s announcement this week at CES that the bleeding-edge of its GPU product line will be targeted at machine learning, at least initially, I thought it would be a good opportunity to take a step back and offer a bit of background on why GPUs and machine learning are so intimately connected in the first place. It has to do with matrices.
First, understand that there’s no magic to machine learning. It’s just math. And in the grand scheme of math, the basic ideas behind machine learning are even kind of simple, at least conceptually. Machine learning is optimization. Given some very long equation with a lot of variables in it, can we come up with a good/reliable way of tweaking those variables such that our very long equation spits out accurate predictions? While this may be a conceptually simple question to ask, actually computing the specific tweaks needed is labor-intensive.
To get some intuition about this sort of optimization, start just by thinking of cause and effect. The air outside is cold. Why? We might look at things like where the jet stream is; what the air pressure is; whether it’s cloudy or sunny out; how much moisture is in the there; and-or what season it is. I’m no meteorologist, but those seem like things that might reasonably predict the air temperature outside, so if, say, we didn’t know the air temperature ahead of time, but we knew all of this other stuff, we might be able to predict the temperature reliably.
Of course, not all of those things are of equal importance when it comes to predicting the air temperature. For example, what season it is might be 10 times as important as anything else, while air moisture might matter only a third as much as elevation. The point is that we can take a bunch of observations and then assign each of them a weight (or emphasis) indicating how important that observation is compared to the others. Then we can take some new observations, plug them into the optimized, weighted equation, and make a solid prediction of how cold it is.
The weighted equation is what’s normally called a model. It models relationships that exist in the world and so it has predictive utility. The hard math is in how we come up with the model, or how we figure out how important each of those different observations are relative to the other ones.
We do this by taking a lot of observations and doing a lot of optimizations one after another. Each one would then look something like the following:
Plug in actual observations into the above equation and we can come up with values for the weights that come closest to the actual temperature on the right-hand side.
For the resulting weights to be meaningful, we have to do this a lot, with a lot of observations. Training a real-life machine learning model might involve doing this same thing millions of times, with each iteration tweaking those weights just a little bit to better optimize the resulting model.
OK, that plus some calculus is the gist of machine learning’s central mechanic. In practice, we’re not actually stepping through each observation one by one. Instead, we’re doing matrix math. Our input observations, millions of rows of them, can actually be viewed as just a big old matrix of numbers where the rows are individual sets of observations and the columns are individual observables (temperature, pressure, season, etc.). And then we can make a seperate matrix out of the temperature observations (the equation’s right-hand side). What we wind up doing to solve for the unknown weights is multiplying matrices, and, if we’re building a neural network, we might being doing a lot of these computationally expensive operations.
Obviously, that’s a big reduction, but the thing to understand is that what we wind up doing in machine learning is crunching together big matrices of numbers. It so happens that this is what happens in graphics processing too, where the matrices instead represent pixels. Computing graphics is all about doing computations across big matrices of pixel data, updating each one. That’s what GPUs exist for: doing computations across big matrices. Massive parallelization.
How is this different from normal computing? The key thing is parallel computation. Generally, in a CPU, we imagine things happening sequentially. This makes sense for computations that depend on each other, where one computation has to wait for another to complete because it depends on the result of the earlier computation. Given this sort of computing, adding more and more cores doesn’t wind up adding all that much computing power, but a GPU can wield hundreds of individual cores and wind up hundreds of times more powerful.
Basically, multiplying matrices means going row by row and column by column and then adding up the results. And this is something that can be done in parallel. Meanwhile, it’s something that’s done very poorly in a conventional computing architecture. In that case, we just wind up with computations standing in line for one of a scant few processor cores.
We can make machine learning algorithms work faster simply by adding more and more processor cores within a GPU. That tends to be an easier engineering problem than those faced by conventional CPUs where parallelism can help with performance sometimes, but finding and implementing that utility is pretty hard. That’s why GPUs are so important to machine learning, and, increasingly, vise versa.