During my lazy weekend afternoons (and all the other days of the week) I've been going through Nando de Freitas' undergraduate machine learning course on youtube. In lecture 26 he introduces gradient descent, an iterative algorithm for optimizing (potentially non-convex) functions. I'll refrain from explaining gradient descent more than that as there are many good explanations on the internet (including Professor de Freitas'), but I do want to discuss gradient descent's learning rate parameter and why intuitively Newton's method, also introduced in lecture 26, is able to side-step the learning rate.
As described in the video lecture, a common difficulty with gradient descent is picking the learning rate hyper-parameter. Pick too small of a learning rate and gradient descent will come to crawl and make very little progress or even eventually stop making progress due to floating point underflow. Pick too large of a learning rate and gradient descent will oscillate around the minimum/maximum, bouncing around, very slowly making progress. In an ideal world, as the gradient vanishes, the gradient descent algorithm would be able to compensate by adjusting the learning rate so to avoid both underflow and oscillation. This is exactly what Newton's method does.
Compare gradient descent
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - \eta_{k}\nabla f(\boldsymbol\theta_{k})$$
to Newton's method
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - H_{k}^{-1}\nabla f(\boldsymbol\theta_{k})$$
The only difference is that Newton's method has replaced the learning rate with the inverse of the Hessian matrix. The lecture derived Newton's method by showing that \(f(\boldsymbol\theta)\) can be approximated by a second-order Taylor series expansion around \(\boldsymbol\theta_{k}\). If you're familiar with Taylor series, then this explanation may be sufficient, but I prefer to think about it differently.
What is this Hessian matrix we've replace the learning rate with? Just as the gradient vector is the multi-dimensional equivalent of the derivative, the Hessian matrix is the multi-dimensional equivalent of the second-derivative. If \(\boldsymbol\theta_{k}\) is our position on the surface of the function we're optimizing at time step \(k\), then the gradient vector, \(\nabla f(\boldsymbol\theta_{k})\), is our velocity at which we're descending towards the optimum, and the Hessian matrix, \(H\), is our acceleration.
Let's think back to the original problem of choosing the learning rate hyper-parameter. It is often too small, causing underflow, or it is too big, causing oscillation, and in an ideal world we could pick just the right the learning rate at any time step, most likely in relation to the rate at which the gradient vector vanishes, to avoid both these problems. But wait, now we have this thing called the Hessian matrix that measures the rate at which the gradient vanishes!
When starting the algorithm's descent we will quickly gain acceleration as we starting going downhill along the function's surface (i.e. the Hessian matrix will "increase"). We don't want to gain too much velocity else we'll overshoot the minimum and start oscillating, so we'd want to choose a smaller learning rate. As the algorithm nears the optimum, the geometry of the function will flatten out and we will eventually lose acceleration. We don't want to lose too much velocity else we'll underflow and stop making progress. It is precisely when we lose acceleration (i.e. the Hessian matrix "decreases") that we want to increase our learning parameter.
So we have observed that heir is an inverse relationship between the magnitude of the acceleration at which we descend towards the minimum and the magnitude of the most desirable learning parameter. That is to say, at greater accelerations we want smaller learning rates and at smaller accelerations we want larger learning rates. It is precisely for this reason why Newton's method multiplies the gradient vector (the "velocity") by the inverse Hessian matrix ("the inverse acceleration")! At least, this is how I intuitively think about Newton's method. The derivation involving Taylor series expansions is probably the actual precise reason, but I like thinking about it in terms of velocities and accelerations.
Finally, I'd like to end with one cautionary note. I'm an amateur mathematician. I like to think my intuition is correct, but it may not be. By sharing my intuition, I hope to help others who are looking to gain intuition, but if you need anything more than an intuitional understanding of Newton's method or gradient descent (because perhaps you're an actual mathematician in training), please consult a textbook or your favourite professor :-).
Showing posts with label TIL. Show all posts
Showing posts with label TIL. Show all posts
Saturday, April 2, 2016
Saturday, October 3, 2015
TIL: premultiplied-alpha colors
Alpha is the measure of how translucent an object is. An alpha of 0.0 means the object is entirely transparent, an alpha of 1.0 means the object is entirely opaque, and an alpha in the middle means a fraction of the total light may passthrough the object. Traditionally, a color is represented by 4 constituent components: a red contribution, a green contribution, a blue contribution, and the alpha. When compositing two colors together, one on top of the other, the alpha acts as a modulus of the colors, indicating how much of the top color and how much of the bottom color contribute to the new composited color. The traditional compositing operation is as follows, where A is being composited over top B:
\cdot&space;B_{rgb})
Alternatively, we may wish to premultiply the red, green, and blue components by the alpha:

With this representation we get a new compositing equation:
\cdot&space;B)
This new form is interesting for a couple reasons.
Alternatively, we may wish to premultiply the red, green, and blue components by the alpha:
- It is computationally more efficient. It requires one less vector multiplication.
- It is a closed form. Compositing a premultiplied-alpha color over top a premultiplied-alpha color yields another premultiplied-alpha color. The same cannot be said of non-premultiplied-alpha colors. Compositing two non-premultiplied-alpha colors yields, interestingly, a premultiplied-alpha color.
- When filtering (aka downsampling), it produces more visually accurate results. A picture is worth a thousands words.
And that's it. Premutliplied-alpha colors are nifty.
Friday, October 2, 2015
TIL: The column space and null space
A vector space is a set of vectors that is closed under addition and scalar multiplication. In other words, given two vectors, v and w, a vector space is formed by the set of all linear combinations formed between v and w, namely cv + dw for arbitrary coefficients c and d.
Columns spaces and null spaces are special categories of vector spaces that have interesting properties related to systems of linear equations, Ax = b. The column space of the matrix A, C(A), is simply the linear combinations of A's column vectors. This implies that Ax = b may only be solved when b is a vector in A's column space. Finally, the null space of matrix A is another vector space formed by all the solutions to Ax = 0.
Columns spaces and null spaces are special categories of vector spaces that have interesting properties related to systems of linear equations, Ax = b. The column space of the matrix A, C(A), is simply the linear combinations of A's column vectors. This implies that Ax = b may only be solved when b is a vector in A's column space. Finally, the null space of matrix A is another vector space formed by all the solutions to Ax = 0.
Thursday, October 1, 2015
TIL: a principled approach to dynamic programming
Dynamic programming has always been a topic I understood at a surface level (it's just memoization, right?!), but ultimately feared for lack of real-world experience solving such problems. I read today a superb explanation of a principled approach to solving dynamic programming problems. Here it is:
- Try to define the problem at hand in terms of composable sub-problems. In other words, ask yourself what information would make solving this problem easier?
- Define a more rigorous recurrence relation between the sub-problems and the current problem. What are the base-cases and how do the sub-problems answer the current problem?
- Build a solution look up table (up to the point of the current problem) by first initializing it for the base cases and then for the sub-problems in a bottom-up approach. The bottom-up approach is the defining characteristic of a dynamic programming solution. Alone, the look up table is just memoization. Note: building the solution table bottom up will often look like a post-order depth-first search.
Here is an implementation (in Scala) to the following problem:
Given a general monetary system with M different coins of value {c1, c2, . . . , cM}, devise an algorithm that can make change for amount N using a minimum number of coins. What is the complexity of your algorithm?
TIL: Alpha-to-coverage for order-independent transparency
Alpha-to-coverage is a computer graphics technique, supported by most (all?) modern GPUs, for rendering translucent multi-sampled anti-aliased primitives. Given an alpha value in the range of [0.0, 1.0] and N samples of color stored per pixel, the alpha channel will be discretized into a coverage-mask of the N samples. An alpha of 0 will generate an empty coverage-mask (no samples take on the new color), an alpha of 1 will result generate a full coverage-mask (all samples take on the new color), and values in between will generate a partial coverage-mask (between 1 and N-1 samples take on the new color).
* I blame my experience. Being a GPU driver engineer, but never a rendering engineer, I'm exposed to large cross-section of rendering techniques and low-level graphics implementation details, but never to the greater-picture of real-time rendering but by word of mouth and self-study.
It's not alpha-to-coverage itself that I learned today however, but rather it's implication; Alpha-to-coverage is a form of order-independent transparency! In the naïve case, one benefits from N-layers of transparency. I suppose this is the whole point of alpha-to-coverage, but I never put two-and-two together*.
* I blame my experience. Being a GPU driver engineer, but never a rendering engineer, I'm exposed to large cross-section of rendering techniques and low-level graphics implementation details, but never to the greater-picture of real-time rendering but by word of mouth and self-study.
TIL: Today I learned
It has been a long time since my last appearance. I've written a couple of unfinished posts in that time, but I've realized that writing a detailed essay that's up to a satisfactory standard of quality is actually quite time consuming (who would have thought!). In an attempt at more productive writing, I will try to publish one post per day expounding on something new I've learned that day. Nothing more than a paragraph or two. My hope is that this will be burden-free enough not to require extraneous effort, yet valuable enough that it's worth recording. Equally, this will serve as a nice personal journal of my educational journey post-schooling.
Subscribe to:
Posts (Atom)