Summary

Questions Covered (beta)

x

Main point

Full script

Suppose I give you two different lists of numbers or maybe two different functions and I ask you to think of all the ways you might combine those two lists to get a new list of numbers or combine the two functions to get a new function. Maybe one simple way that comes to mind is to simply add them together term by term. Likewise, with the functions, you can add all the corresponding outputs. In a similar vein, you could also multiply the two lists term by term and do the same thing with the functions. But there's another kind of combination just as fundamental as both of those but a lot Show more

Show less

Main point

Full script

less commonly discussed known as a Convolution. But unlike the previous two cases, it's not something that's merely inherited from an operation you can do to numbers. It's something genuinely new for the context of list of numbers or combining functions. They show up all over the place, they are ubiquitous in image processing. It's a core construct in the theory of probability, they're used a lot in solving differential equations and one context where you've almost certainly seen it if not by this name is multiplying to polynomials together. As someone in the business of visual explanations, this is an especially great topic. Because the formulaic definition in isolation and without context can look kind of intimidating. But if we take the time to really unpack what it's saying and before that, actually motivate why you would want something like this, it's an incredibly beautiful operation. And I have to admit I actually learned a little something while putting together the visuals for this project. In the case of convolving two different functions, I was trying to think of different ways you might picture what that mean. And with one of them, I had a little bit of an aha moment for why it is that normal distributions play the role that they do in probability. Why it's such a natural shape for a function but I'm getting ahead of myself, there's a lot of setup for that one. In this video, our primary focus is just going to be on the discrete case and in particular building up to a very unexpected but very clever algorithm for computing these, and Show more

Show less

Main point

Full script

I'll pull out the discussion for the continuous case into a second part. It's very tempting to open up with the image processing examples since they're visually the most intriguing but there are couple bits of finickiness that make the image processing case less representative of convolutions overall. So instead, let's kick things off with probability. Show more

Show less

Main point

Full script

And in particular, one of the simplest examples that I'm sure everyone hears thought about at some point in their life. Which is rolling a pair of dice and figuring out the chances of seeing various different sums. And you might say not a problem. Not a problem. Each of your two dice has six different possible outcomes, which gives us a total of 36 distinct Show more

Show less

Main point

Full script

possible pairs of outcomes. And if we just look through them all, we can count up how many pairs have a given sum. And arranging all the pairs in a grid like this, one pretty nice thing is that all of the pairs that have a constant sare visible along one of these different diagonals. So simply counting how many exist on each of those diagonals will tell you how likely you are to see a particular sum. And I'd say, very good, very good. Show more

Show less

Main point

Full script

But can you think of any other ways that you might visualize the same question? Other images that can come to mind, to think of all the distinct pairs that have a given sum? And maybe one of you raises your hand and says, 'Yeah, I've got one.' Let's say you picture these two different sets of possibilities each in a row, but you Show more

Show less

Main point

Full script

flip around that second row. That way, all of the different pairs which add up to seven line up vertically like this... And if we slide that bottom row all the way to the right, then the unique pair that adds up to two, the snake eyes, are the only ones that align. And if I shlunk that over one unit to the right, the pairs which align are the two different pairs that add up to three. And in general, different off set values of this lower array, which remember, I had to flip around first, reveal all the distinct pairs that have a given sum. As far as probability questions go, this still isn't especially interesting because all we're Show more

Show less

Main point

Full script

doing is counting how many outcomes there are in each of these categories. But that is with the implicit assumption that there's an equal chance for each of these faces to come up. But what if I told you I have a special set of dice that's not uniform. Maybe the blue die has its own set of numbers describing the probabilities for each face coming up and the red die has its own unique distinct set of numbers. In that case if you wanted to figure out, say, the probability of seeing a two, you would multiply the probability that the blue die is a 1X the probability that the red die is a one. And for the chances of seeing a three, you look at the two distinct pairs where that's possible and again multiply the corresponding probabilities and then add those two products together. Similarly, the chances of seeing a four involves multiplying together three different pairs of possibilities and adding them all together. And in the spirit of setting up some formulas, let's name these top probabilities "A1, A2, A3" and so on and name the bottom ones "B1, B2, B3" and so on. And in general, this process where we are taking two different arrays of numbers, flipping the second one around and then lining them up at various different off set values, taking a bunch of pairwise products and adding them up, that's one of the fundamental ways to think about what a convolution is. So, just to spell it out a little more exactly- through this process, we just generated probabilities for seeing two, three, four, on and on up to 12. And we got them by mixing together one list of values, A and another list of values, B. In the lingo, we'd say the convolution of those two sequences gives us this new sequence; the new sequence of 11 values, each of which looks like some sof pairwise products. If you prefer, another way you could think about the same operation is to first create a table of all the pairwise product and then add up along all these diagonals. Again, that's a way of mixing together these two sequences of numbers to get us a new sequence of eleven numbers. It's the same operation as the sliding windows thought, just another perspective. Putting a little notation to it, here's how you might see it written down: The convolution of A and B, denoted with this little asterisk, is a new list. And the Nth element of that list looks like a sand that sgoes over all different pairs of indices, I and J, so that the s of those indices is equal to N. It's kind of a mouthful but for example, if N was six, the pairs we're going over are one and five, two and four, three and three, four and two, five and one. All the different pairs that add up to six. But honestly, however you write it down, the notation is secondary in importance to the visual you might hold in your head for the process. Show more

Show less

Main point

Full script

Here, maybe it helps to do a super simple example where I might ask you, what's the convolution of the list one, two, three with the list four, five, six? You might picture taking both of these lists, flipping around that second one and then starting with its slid all the way over to the left. Then the pair of values which align are one and four, multiply them together and that gives us our first term of our output. Slide that bottom array one unit to the right, the pairs which align are 1, 5, 2 and 4. Multiply those pairs, add them together and that gives us 13. The next entry in our output. Slide things over once more and we'll take 1X6+2X5+3X4 which happens to be 28. One more slide and we get 2X6+3X5 and that gives us 27 and finally the last term looks like 3X6. If you'd like, you can pull up whatever your favorite programming languages and your favorite library that includes various numerical operations and you can confirm I'm not lying to you. If you take the convolution of 1, 2, 3 against 4, 5, 6, this is indeed the result that you'll get. We've seen one case where this is a natural and desirable operation adding up to probability distributions. And another common example would be a moving average. Show more

Show less

Main point

Full script

Imagine you have some long list of numbers and you take another smaller list of numbers that all add up to one. In this case, I just have a little list of five values and they're all equal to one fifth. Then if we do this sliding window convolution process and kind of close our eyes and sweep under the rug what happens at the very beginning of it. Once our smaller list of values entirely overlaps with the bigger one, think about what each term in this convolution really means. At each iteration, what you're doing is multiplying each of the values from your data by one fifth and adding them all together. Which is to say you're taking an average of your data inside this little window. Overall, the process gives you a smooth out version of the original data. And you could modify this starting with a different little list of numbers and as long as that little list all adds up to one, you can still interpret it as a moving average. In the example shown here, that moving average would be giving more weight towards the central value. This also results in a smooth out version of the data. If you do kind of a two dimensional analog of this, it gives you a fun algorithm for blurring a given image. And I should say the animations I'm about to show are modified from something I originally made for part of a set of lectures I did with the Julia Lab at MIT for a certain open courseware Show more

Show less

Main point

Full script

class that included an image processing unit. There we did a little bit more to dive into the code behind all of this. So if you're curious, I'll leave you some links. But focusing back on this blurring example. What's going on is I've got this little 3X3 grid of values that marching along our original image. And if we zoom in, each one of those values is one ninth. Show more

Show less

Main point

Full script

And what I'm doing at each iteration is multiplying each of those values by the corresponding pixel that it sits on top of. And of course in computer science, we think of colors as little vectors of three values representing the red, green and blue components. When I multiply all these little values by one ninth and I add them together, it gives us an average along each color channel and the corresponding pixel for the image on the right is defined to be that sum. The overall effect as we do this for every single pixel on the image, is that each one kind of bleeds into all of its neighbours which gives us a blurrier version than the original. In the lingo, we'd say that the image on the right is a convolution of our original image Show more

Show less

Main point

Full script

with the little grid of values. Or more technically, maybe I should say that it's the convolution with a 180 degree rotated version of that little grid of values. Not that it matters when the grid is symmetric but it's just worth keeping in mind that the definition of a convolution, as inherited from the pure math context, should always invite you to think about flipping around that second array. If we modify this slightly, we can get a much more elegant blurring effect by choosing a different grid of values. In this case, I have a little 5X5 grid but the distinction is not so much its size. Show more

Show less

Main point

Full script

If we zoom in, we notice that the value in the middle is a lot bigger than the value towards the edges. And where this is coming from, is they're all sampled from a bell curve known as a Gaussian distribution. That way, when we multiply all of these values by the corresponding pixel that they're on top of, we're giving a lot more weight to that central pixel and much less towards the ones out at the edge. And just as before, the corresponding pixel on the right is defined to be this sum. As we do this process for every single pixel, it gives a blurring effect which much more authentically simulates the notion of putting your lens out of focus or something like that. But blurring is far from the only thing that you can do with this idea. For instance, take a look at this little grid of values which involves some positive numbers on the left and some negative numbers on the right, which I'll color with blue and red respectively. Take a moment to see if you can predict and understand what effect this will have on the Show more

Show less

Main point

Full script

final image. So in this case, I'll just be thinking of the image as grey scale instead of coloured. So each of the pixels is just represented by one number instead of three. And one thing worth noticing is that as we do this convolution, it's possible to get negative values. For example, at this point here if we zoom in, the left half of our little grid sits entirely on top of black pixels which would have a value of zero. But the right half of negative values all sit on top of white pixels which would have a value of one. So when we multiply corresponding terms and add them together, the result will be very negative. And the way I'm displaying this with the image on the right, is to color negative values red and positive values blue. Another thing to notice is that when you are on a patch that's all the same color, everything goes to zero since the sof the values in our little grid is zero. This is very different from the previous two examples where the sof our little grid was one, which let us interpret it as a moving average and hence a blur. All in all, this little process basically detects wherever there is variation in the pixel value as you move from left to right. And so it gives you a kind of way to pick up on all the vertical edges from your image. And similarly, if we rotated that grid around so that it varies as you move from the top to the bottom, this will be picking up on all the horizontal edges. Which in the case of our little pie creature image, does result in some pretty demonic eyes. This smaller grid, by the way, is often called a kernel. And the beauty here is how just by choosing a different kernel, you can get different image processing effect. Not just blurring your edge detection but also things like sharpening. For those of you who have heard of a convolutional neural network, the idea there is to use data to figure out what the kernels should be in the first place. As determined by whatever the neural network wants to detect. Another thing I should maybe bring up is the length of the output. For something like the moving average example, you might only want to think about the terms when both of the windows fully align with each other. Or in the image processing example, maybe you want the final output to have the same size as the original. Now, convolutions as a pure math operation, always produce an array that's bigger than the two arrays that you started with. At least assuming one of them doesn't have a length of one. Just know that in certain computer science context, you often want to deliberately truncate that output. Another thing worth highlighting is that in the computer science context, this notion of flipping around that kernel before you let it march across the original, often feels really weird and just uncalled for. But again, note that that's what's inherited from the pure math context, where like we saw with the probabilities, it's an incredibly natural thing to do. And actually, I can show you one more pure math example where even the programmers should care about this one because it opens the doors for much faster algorithm to compute all of these. To set up what I mean by faster here, let me go back and pull up some Python again and Show more

Show less

Main point

Full script

I'm going to create two different relatively big arrays. Each one will have 100,000 random elements in it and I'm going to assess the run time of the convolve function from the Numpy library. And in this case, it runs it for multiple different iterations, tries to find an average and it looks like on this computer, at least, it averages at 4.87 seconds. By contrast, if I use a different function from the SciPy library called FFT convolve, which is the same just implemented differently, that only takes 4.3 milliseconds on average. So, three orders of magnitude improvement. And again, even though it flies under a different name, it's giving the same output that the other convolve function does, it's just doing something to go about it in a cleverer way. [Music] - Remember how with the probability example, I said another way you could think about the Show more

Show less

Main point

Full script

convolution was to create this table of all the pairwise products and then add up those pairwise products along the diagonals. There's, of course, nothing specific to probability. Anytime you're convolving two different lists of numbers, you can think about it this way. Create this kind of multiplication table with all Pairwise products and then each salong the diagonal, corresponds to one of your final outputs. One context where this view is especially natural, is when you multiply together two polynomials. Show more

Show less

Main point

Full script

For example, let me take the little grid we already have and replace the top terms with 1, 2X and 3X squared and replace the other terms with 4, 5X and 6X squared. Now, think about what it means when we're creating all of these different pairwise products between the two lists. What you're doing is essentially expanding out the full product of the two polynomials I have written down. And then when you add up along the diagonal, that corresponds to collecting all like terms. Which is pretty neat, expanding a polynomial and collecting like terms is exactly the same process as a convolution. Show more

Show less

Main point

Full script

But this allows us to do something that's pretty cool. Because think about what we're saying here, we're saying if you take two different functions and you multiply them together, which is a simple point wise operation, that's the same thing as if you had first extracted the coefficients from each one of those, assuming they're polynomials, and then taking a convolution of those two lists of coefficients. What makes that so interesting is that convolutions feel, in principle, a lot more complicated than simple multiplication. And I don't just mean conceptually they are harder to think about, I mean computationally it requires more steps to perform a convolution than it does to perform a point wise product of two different lists. For example, let's say I gave you two really big polynomials. Say each one with 100 different coefficients. Then if the way you multiply them was to expand out this product, you know, filling in this entire 100X100 grid of pairwise products. That would require you to perform 10,000 different products. And then when you're collecting all the like terms along the diagonals, that's another set of around 10,000 operations. More generally in the lingo, we'd say the algorithm is O of N squared. Meaning for two lists of size N, the way that the number of operations scales, is in proportion to the square of N. On the other hand, if I think of two polynomials in terms of their outputs, for example, sampling their values at some handful of inputs. Then multiplying them only requires as many operations as the number of samples. Since again it's a point wise operation and with polynomials, you only need finitely mini samples to be able to recover the coefficients. For example, two outputs are enough to uniquely specify a linear polynomial. Show more

Show less

Main point

Full script

Three outputs would be enough to uniquely specify a quadratic polynomial and in general, if you know N distinct outputs, that's enough to uniquely specify a polynomial that has N different coefficients. Or if you prefer, we could phrase this in the language of systems of equations. Imagine I tell you, I have some polynomial but I don't tell you what the coefficients are, those are a mystery to you. In our example, you might think of this as the product that we're trying to figure out. And then suppose I say, I'll just tell you what the outputs of this polynomial would be if you inputted various different inputs like 0, 1, 2, 3 on and on. And I give you enough so that you have as many equations as you have unknowns. It even happens to be a linear system of equations so that's nice. And in principle, at least, this should be enough to recover the coefficients. So the rough algorithm outline then would be whenever you want to convolve two lists of numbers, you treat them like they're coefficients of two polynomials. You sample those polynomials at enough outputs, multiply those samples point wise, and then solve the system to recover the coefficients as a sneaky backdoor way to find the convolution. And as I've stated it so far, at least, some of you could rightfully complain, Grant, that Show more

Show less

Main point

Full script

is an idiotic plan. Because for one thing, just calculating all these samples for one of the polynomials we know, already takes on the order of N squared operations. Not to mention, solving that system is certainly going to be computationally as difficult as just doing the convolution in the first place. So like, sure, we have this connection between multiplication and convolutions but all of Show more

Show less

Main point

Full script

the complexity happens in translating from one viewpoint to the other. But there is a trick. And those of you who know about fourier transforms and the FFT algorithm, might see where this is going. If you're unfamiliar with these topics, what I'm about to say might seem completely out of the blue. Just know that there are certain paths you could have walked in math that make this more of an expected step. Basically the idea is that we have a freedom of choice here. Show more

Show less

Main point

Full script

If instead of evaluating as some arbitrary set of inputs like 0, 1, 2, 3 on and on, you choose to evaluate on a very specially selected set of complex numbers. Specifically the ones that sit evenly spaced on the unit circle, what are known as the roots of unity. This gives us a friendlier system. The basic idea is that by finding a number where taking its powers falls into this cycling pattern, it means that the system we generate is going to have a lot of redundancy in the different terms that you're calculating. And by being clever about how you leverage that redundancy, you can save yourself a lot of work. This set of outputs that I've written has a special name. It's called the discrete fourier transform of the coefficients. And if you want to learn more, I actually did another lecture for that same Julia MIT Show more

Show less

Main point

Full script

class, all about discrete fourier transforms. And there's also a really excellent video on the channel reducible, talking about the fast fourier transform, which is an algorithm for computing these more quickly. Also Veritasirecently did a really good video on FFTs so you've got lots of options. And that fast algorithm really is the point for us. Again, because of all this redundancy, there exist a method to go from the coefficients to all of these outputs. Where instead of doing on the order of N squared operations, you do on the order of N times the log of N operations which is much much better as you scale to big lists. And importantly, this FFT algorithm goes both ways. It also lets you go from the outputs to the coefficients. Though bringing it all together, let's look back at our algorithm outline. Now, we can say, whenever you're given two long lists of numbers and you want to take their convolution, first, compute the fast fourier transform of each one of them. Show more

Show less

Main point

Full script

Which in the back of your mind, you can just think of as treating them like they're the coefficients of a polynomial and evaluating it at a very specially selected set of points. Then, multiply together the two results that you just got, point wise, which is nice and fast. And then do an inverse fast fourier transform and what that gives you is the sneaky back door way to compute the convolution that we were looking for. But this time, it only involves O of N log N operations. Show more

Show less

Main point

Full script

That's really cool to me. This very specific context where convolutions show up, multiplying two polynomials, opens the doors for an algorithm that's relevant everywhere else where convolutions might come up. If you want to add probability distributions to some large image processing, whatever it might be. And I just think that's such a good example of why you should be excited when you see some operation or concept in math show up in a lot of seemingly unrelated areas. If you want a little homework, here's something that's fun to think about. Show more

Show less

Main point

Full script

Explain why when you multi two different numbers, just ordinary multiplication the way we all learn in elementary school, what you're doing is basically a convolution between the digits of those numbers. There's some added steps with carries and the like but the core step is a convolution. In light of the existence of a fast algorithm, what that means is if you have two very large Show more

Show less

Main point

Full script

integers, then there exist a way to find their product that's faster than the method we learn in elementary school. That instead of requiring O of N squared operations, only requires O of N log N. Which doesn't even feel like it should be possible. The catch is that before this is actually useful in practice, your numbers would have to be absolutely monstrous. But still it's cool that such an algorithm exists. And next up, will turn our attention to the continuous case with the special focus on probability distributions. Show more

Show less