Wednesday, December 9, 2015

On optimizing high-dimensional non-convex functions

Excuse me for the (in)completeness of this post. What follows is merely a thought, inspired by two independent statements, about a domain of science (or math, really) with which I am barely initiated. Let me give you these two statements first.

In the book Foundations of Data Science, the very first paragraph of the book says
If one generates n points at random in the unit d-dimensional ball, for sufficiently large d, with high probability the distances between all pairs of points will be essentially the same. Also the volume of the unit ball in d-dimensions goes to zero as the dimension goes to infinity. 
That is statement one. Statement two is an assertion given in this year's Deep Learning Tutorial at the Neural Information Processing Systems (NIPS) conference. They claim that "most local minima are close to the bottom (global minimum error)."

The second statement is significant for training (deep) neural networks. Training a neural network always uses a variant of a gradient descent, a method for minimizing the neural network's error by iteratively solving for when the gradient (i.e. the derivative) of the function that gives the error is 0. Ideally gradient descent will approach the global minimum error (the point where the neural network will give the most desirable approximations of the training data), but that is only guaranteed when the error function is convex (there are no hills when the function is plotted). Unfortunately, a neural network almost never has a convex error function. Hence, during gradient descent, one might find a local minimum that is nowhere near as low as the global minimum. The second statement however says we shouldn't be so concerned -- all local minima will be close to the global minimum!

But why?

My thought is that somehow the first statement can be used to prove the second statement. The training data of modern neural networks is high dimensional. It will also often be normalized as part of feature scaling. Given those two properties, the high-dimensional hills and valleys of a neural network's error function then (roughly) form unit hemispheres. This (possibly) implies that that local minima are close to the global minimum because the volumes of every valley nears zero, making all the valleys increasing similar to each other as the dimensionality of the training data increases.

I want to stress though, this is a half-baked idea. It's probably rubbish. It might also be redundant! Perhaps the second statement already has a proof and I've missed it. Either way, I would love to see if others find my intuition plausible or, even better, to be pointed in the direction of a proof for the second statement.