## Wednesday, December 9, 2015

### On optimizing high-dimensional non-convex functions

Excuse me for the (in)completeness of this post. What follows is merely a thought, inspired by two independent statements, about a domain of science (or math, really) with which I am barely initiated. Let me give you these two statements first.

In the book Foundations of Data Science, the very first paragraph of the book says
If one generates n points at random in the unit d-dimensional ball, for sufficiently large d, with high probability the distances between all pairs of points will be essentially the same. Also the volume of the unit ball in d-dimensions goes to zero as the dimension goes to infinity.
That is statement one. Statement two is an assertion given in this year's Deep Learning Tutorial at the Neural Information Processing Systems (NIPS) conference. They claim that "most local minima are close to the bottom (global minimum error)."

The second statement is significant for training (deep) neural networks. Training a neural network always uses a variant of a gradient descent, a method for minimizing the neural network's error by iteratively solving for when the gradient (i.e. the derivative) of the function that gives the error is 0. Ideally gradient descent will approach the global minimum error (the point where the neural network will give the most desirable approximations of the training data), but that is only guaranteed when the error function is convex (there are no hills when the function is plotted). Unfortunately, a neural network almost never has a convex error function. Hence, during gradient descent, one might find a local minimum that is nowhere near as low as the global minimum. The second statement however says we shouldn't be so concerned -- all local minima will be close to the global minimum!

But why?

My thought is that somehow the first statement can be used to prove the second statement. The training data of modern neural networks is high dimensional. It will also often be normalized as part of feature scaling. Given those two properties, the high-dimensional hills and valleys of a neural network's error function then (roughly) form unit hemispheres. This (possibly) implies that that local minima are close to the global minimum because the volumes of every valley nears zero, making all the valleys increasing similar to each other as the dimensionality of the training data increases.

I want to stress though, this is a half-baked idea. It's probably rubbish. It might also be redundant! Perhaps the second statement already has a proof and I've missed it. Either way, I would love to see if others find my intuition plausible or, even better, to be pointed in the direction of a proof for the second statement.

## Saturday, October 3, 2015

### TIL: premultiplied-alpha colors

Alpha is the measure of how translucent an object is. An alpha of 0.0 means the object is entirely transparent, an alpha of 1.0 means the object is entirely opaque, and an alpha in the middle means a fraction of the total light may passthrough the object. Traditionally, a color is represented by 4 constituent components: a red contribution, a green contribution, a blue contribution, and the alpha. When compositing two colors together, one on top of the other, the alpha acts as a modulus of the colors, indicating how much of the top color and how much of the bottom color contribute to the new composited color. The traditional compositing operation is as follows, where A is being composited over top B:

$\inline&space;\\&space;A&space;=&space;\begin{bmatrix}&space;r_{a}&space;&&space;g_{a}&space;&&space;b_{a}&space;&&space;a_{a}&space;\end{bmatrix}&space;\\&space;B&space;=&space;\begin{bmatrix}&space;r_{b}&space;&&space;g_{b}&space;&&space;b_{b}&space;&&space;a_{b}&space;\end{bmatrix}&space;\\&space;C_{rgb}&space;=&space;a_{a}\cdot&space;A_{rgb}&space;+&space;(1&space;-&space;a_{a})\cdot&space;B_{rgb}$

Alternatively, we may wish to premultiply the red, green, and blue components by the alpha:

$C&space;=&space;\begin{bmatrix}&space;a\cdot&space;r&space;&&space;a\cdot&space;g&space;&&space;a\cdot&space;b&space;&&space;a\end{bmatrix}$

With this representation we get a new compositing equation:

$C&space;=&space;A&space;+&space;(1&space;-&space;a_{A})\cdot&space;B$

This new form is interesting for a couple reasons.
1. It is computationally more efficient. It requires one less vector multiplication.
2. It is a closed form. Compositing a premultiplied-alpha color over top a premultiplied-alpha color yields another premultiplied-alpha color. The same cannot be said of non-premultiplied-alpha colors. Compositing two non-premultiplied-alpha colors yields, interestingly, a premultiplied-alpha color.
3. When filtering (aka downsampling), it produces more visually accurate results. A picture is worth a thousands words.
And that's it. Premutliplied-alpha colors are nifty.

## Friday, October 2, 2015

### TIL: The column space and null space

A vector space is a set of vectors that is closed under addition and scalar multiplication. In other words, given two vectors, v and w,  a vector space is formed by the set of all linear combinations formed between v and w, namely cv + dw for arbitrary coefficients c and d.

Columns spaces and null spaces are special categories of vector spaces that have interesting properties related to systems of linear equations, Ax = b. The column space of the matrix A, C(A), is simply the linear combinations of A's column vectors. This implies that Ax = b may only be solved when b is a vector in A's column space. Finally, the null space of matrix A is another vector space formed by all the solutions to Ax = 0.

## Thursday, October 1, 2015

### TIL: a principled approach to dynamic programming

Dynamic programming has always been a topic I understood at a surface level (it's just memoization, right?!), but ultimately feared for lack of real-world experience solving such problems. I read today a superb explanation of a principled approach to solving dynamic programming problems. Here it is:

1. Try to define the problem at hand in terms of composable sub-problems. In other words, ask yourself what information would make solving this problem easier?
2. Define a more rigorous recurrence relation between the sub-problems and the current problem. What are the base-cases and how do the sub-problems answer the current problem?
3. Build a solution look up table (up to the point of the current problem) by first initializing it for the base cases and then for the sub-problems in a bottom-up approach. The bottom-up approach is the defining characteristic of a dynamic programming solution. Alone, the look up table is just memoization. Note: building the solution table bottom up will often look like a post-order depth-first search.
Here is an implementation (in Scala) to the following problem:
Given a general monetary system with M different coins of value {c1, c2, . . . , cM}, devise an algorithm that can make change for amount N using a minimum number of coins. What is the complexity of your algorithm?

### TIL: Alpha-to-coverage for order-independent transparency

Alpha-to-coverage is a computer graphics technique, supported by most (all?) modern GPUs, for rendering translucent multi-sampled anti-aliased primitives. Given an alpha value in the range of [0.0, 1.0] and N samples of color stored per pixel, the alpha channel will be discretized into a coverage-mask of the N samples. An alpha of 0 will generate an empty coverage-mask (no samples take on the new color), an alpha of 1 will result generate a full coverage-mask (all samples take on the new color), and values in between will generate a partial coverage-mask (between 1 and N-1 samples take on the new color).

It's not alpha-to-coverage itself that I learned today however, but rather it's implication; Alpha-to-coverage is a form of order-independent transparency! In the naïve case, one benefits from N-layers of transparency. I suppose this is the whole point of alpha-to-coverage, but I never put two-and-two together*.

* I blame my experience. Being a GPU driver engineer, but never a rendering engineer, I'm exposed to large cross-section of rendering techniques and low-level graphics implementation details, but never to the greater-picture of real-time rendering but by word of mouth and self-study.

### TIL: Today I learned

It has been a long time since my last appearance. I've written a couple of unfinished posts in that time, but I've realized that writing a detailed essay that's up to a satisfactory standard of quality is actually quite time consuming (who would have thought!). In an attempt at more productive writing, I will try to publish one post per day expounding on something new I've learned that day. Nothing more than a paragraph or two. My hope is that this will be burden-free enough not to require extraneous effort, yet valuable enough that it's worth recording. Equally, this will serve as a nice personal journal of my educational journey post-schooling.