Jedd's Computer (Science) Forays: 2016

A really wonderful aspect of learning about machine learning is that you can't help but learn about the field statistics as well. As a computer scientist (or really, a software engineer -- I have a hard time calling myself a computer scientist), one of the joys of the field is that many interesting problems require expertise from other fields and so computer scientists tend to be exposed to a great many of ideas (even if only in a shallow sense)!

One of the major divides among statisticians is in the philosophy of probability. On one hand there are frequentists who like to place probabilities only on data; there is some underlying process that the data describes, but one doesn't have knowledge of all data, so there only exists uncertainty about what unknown data the process could generate. On the other hand there are bayesians who, like the frequentists, place probabilities on data for the same reason, but who also place probabilities on the models they use to describe the underlying process and on the parameters of these models. They allow themselves to do this because they concede that, although there does exist one true model that could perfectly describe the underlying process, as the modeler they have their own uncertainty about what that model may be.

I think this is an interesting divide because it highlights two different sources of uncertainty. Some uncertainty exists because there exists true randomness in the world (or so physicists believe), while other uncertainty exists solely due to our own ignorance -- the quantity in question may be completely determined! We are just forced to work with incomplete information. I think it is fascinating that there exists a unified framework that allows us to manage both sources of uncertainty, but it does make me wonder: do these two sources of uncertainty warrant their own frameworks?

During my lazy weekend afternoons (and all the other days of the week) I've been going through Nando de Freitas' undergraduate machine learning course on youtube. In lecture 26 he introduces gradient descent, an iterative algorithm for optimizing (potentially non-convex) functions. I'll refrain from explaining gradient descent more than that as there are many good explanations on the internet (including Professor de Freitas'), but I do want to discuss gradient descent's learning rate parameter and why intuitively Newton's method, also introduced in lecture 26, is able to side-step the learning rate.

As described in the video lecture, a common difficulty with gradient descent is picking the learning rate hyper-parameter. Pick too small of a learning rate and gradient descent will come to crawl and make very little progress or even eventually stop making progress due to floating point underflow. Pick too large of a learning rate and gradient descent will oscillate around the minimum/maximum, bouncing around, very slowly making progress. In an ideal world, as the gradient vanishes, the gradient descent algorithm would be able to compensate by adjusting the learning rate so to avoid both underflow and oscillation. This is exactly what Newton's method does.

Compare gradient descent
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - \eta_{k}\nabla f(\boldsymbol\theta_{k})$$
to Newton's method
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - H_{k}^{-1}\nabla f(\boldsymbol\theta_{k})$$

The only difference is that Newton's method has replaced the learning rate with the inverse of the Hessian matrix. The lecture derived Newton's method by showing that $f(\boldsymbol\theta)$ can be approximated by a second-order Taylor series expansion around $\boldsymbol\theta_{k}$. If you're familiar with Taylor series, then this explanation may be sufficient, but I prefer to think about it differently.

What is this Hessian matrix we've replace the learning rate with? Just as the gradient vector is the multi-dimensional equivalent of the derivative, the Hessian matrix is the multi-dimensional equivalent of the second-derivative. If $\boldsymbol\theta_{k}$ is our position on the surface of the function we're optimizing at time step $k$, then the gradient vector, $\nabla f(\boldsymbol\theta_{k})$, is our velocity at which we're descending towards the optimum, and the Hessian matrix, $H$, is our acceleration.

Let's think back to the original problem of choosing the learning rate hyper-parameter. It is often too small, causing underflow, or it is too big, causing oscillation, and in an ideal world we could pick just the right the learning rate at any time step, most likely in relation to the rate at which the gradient vector vanishes, to avoid both these problems. But wait, now we have this thing called the Hessian matrix that measures the rate at which the gradient vanishes!

When starting the algorithm's descent we will quickly gain acceleration as we starting going downhill along the function's surface (i.e. the Hessian matrix will "increase"). We don't want to gain too much velocity else we'll overshoot the minimum and start oscillating, so we'd want to choose a smaller learning rate. As the algorithm nears the optimum, the geometry of the function will flatten out and we will eventually lose acceleration. We don't want to lose too much velocity else we'll underflow and stop making progress. It is precisely when we lose acceleration (i.e. the Hessian matrix "decreases") that we want to increase our learning parameter.

So we have observed that heir is an inverse relationship between the magnitude of the acceleration at which we descend towards the minimum and the magnitude of the most desirable learning parameter. That is to say, at greater accelerations we want smaller learning rates and at smaller accelerations we want larger learning rates. It is precisely for this reason why Newton's method multiplies the gradient vector (the "velocity") by the inverse Hessian matrix ("the inverse acceleration")! At least, this is how I intuitively think about Newton's method. The derivation involving Taylor series expansions is probably the actual precise reason, but I like thinking about it in terms of velocities and accelerations.

Finally, I'd like to end with one cautionary note. I'm an amateur mathematician. I like to think my intuition is correct, but it may not be. By sharing my intuition, I hope to help others who are looking to gain intuition, but if you need anything more than an intuitional understanding of Newton's method or gradient descent (because perhaps you're an actual mathematician in training), please consult a textbook or your favourite professor :-).

Jedd's Computer (Science) Forays

Monday, May 2, 2016

Frequentists vs Bayesians

Saturday, April 2, 2016

An intuition of Newton's method