Jedd's Computer (Science) Forays

Learning vs Inference

2017-09-14T21:17:00.003-07:00

In machine learning, the terms "learning" and "inference" are used often and it's not always clear what is meant. For example, is "variational inference" and neural network "inferencing" the same thing? Usually not!

When the deep learning crowd says "inference", what they mean is "perform a prediction." When the bayesian crowd says "inference", what they really mean is "compute the posterior distribution of a latent variable model"; e.g., for a model p(x, z) = p(x | z)p(z) and data x, compute p(z | x).

When someone says learning, they mean their model has a set of parameters and they need to find a good setting of those parameters; e.g. their model is actually p(x, z) = p(x | z; theta)p(z) and they want to find theta. For frequentists, this means finding a point estimate of the parameters. For a bayesian, learning is inference because the parameters are treated as latent, random variables, so they can use normal variational inference or MCMC to compute a distribution over the parameters. Note that for 99% of deep learning practitioners, training is synonymous with learning as the practitioner will likely only be computing a single set of values (aka point estimate) for their weights.

Why is the ELBO difficult to optimize?

2017-09-10T23:30:00.002-07:00

The task of bayesian inference is to compute the posterior p(z | x) of the model p(x, z) = p(x|z)p(z) where z is a latent variable. This is often intractable, so a bayesian may resort to approximating it with some easier distribution q(z) — this method is called variational inference . Usually the approximating distribution is chosen from a parameterized family of distributions. Thus q(z) is really the variational distribution q(z; lambda) for some chosen value lambda. How is this parameter lambda chosen?

A common method is to minimize the KL-divergence between the variational distribution and the true posterior, KL[q(z; lambda) || p(z | x)]. Of course, the posterior isn’t actually known, but if we break down this KL divergence, we see the following:

KL[q(z) || p(z | x)] = E_q(z)[log(q(z)) - log(p(z|x))]
= E_q(z)[log(q(z)) - log(p(x, z)/p(x))]
= E_q(z)[log(q(z)) - log(p(x, z)) + log(p(x))]
= E_q(z)[log(q(z)) - log(p(x, z))] + log(p(x))
= E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))] + log(p(x))

The KL divergence can be minimized up to a constant (the marginal likelihood) wrt the variational parameters since we know how to compute p(x, z) per the model’s definition. This term that we want to minimize — E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))] — is the negative evidence lower bound (ELBO). So, back to the original title question — why is the ELBO difficult (but not impossible) to optimize?

It actually isn’t difficult to optimize for certain classes of models. If all the latent variables are chosen to be independent in the variational distribution (i.e. the mean-field assumption) and the model is conditionally conjugate, then one can use coordinate ascent variational inference. This method is fast to compute. Unfortunately, the restrictions it places on the model are also quite limiting. It is when the model is unconstrained is optimizing the ELBO difficult. But again, why?

If nothing is known about the model p(x|y)p(y) except a method to evaluate it, we are relegated to optimizing the ELBO by gradient descent. This entails computing the gradient of the ELBO (excuse my abuse of notation):

Gradient of E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))]
= gradient of (integral of q(z)*(log(q(z)) - log(p(x|z)) - log(p(z)))dz)
= gradient of (integral of q(z)*log(q(z))*dz) - (integral of q(z)*(log(p(x|z)) - log(p(z))*dz))
= integral of (gradient of q(z)*log(q(z))*dz)) - (integral of (gradient of q(z))*(log(p(x|z)) - log(p(z))*dz))

where the I've shortened q(z; lambda) to q(z) and the gradient is wrt lambda.

When I first saw this equation, I wasn’t sure what the big deal was. This is because I wasn’t accustom to staring at math equations all day, but to any (applied) mathematician it should be obvious. I’ll put in bold font: computing the gradient of the ELBO *requires computing some likely high-dimensional, likely intractable integral.* Why is it intractable? Well, you may get lucky and the integral has an analytic solution, but in general that won’t be true. Also, because this quantity no longer takes the form of an expectation, it can’t easily be estimated by Monte Carlo. It may be possible to do so if z is discrete and its sample space is small (as you could exhaustively evaluate the integrand at all z), but that implies z is very low dimensional which may not be true either.

Luckily there exist some methods to side-step these issues by placing mild(ish) constraints on either the variational family of distributions or both the variational family and the model. These methods are the score function gradient and reparameterization gradient, but I won’t be discussing them in this post — they’re best left to be explained another day (or by Google)..

Expectation Maximization vs Variational Bayes

2017-09-07T00:24:00.000-07:00

I constantly find myself forgetting the details of the EM algorithm, variational bayes, and what exactly the difference is between the two. To avoid confusion in the future, I wrote the following note.

Q: What is the difference between EM and Variational Bayes (aka "Variational Bayesian EM")?
A: Both are iterative methods for learning about the parameters theta and the posterior p(z | x, theta) in a latent variable model p(x, z | theta) by trying to a maximize a lower bound to the marginal likelihood p(x) (or p(x | theta) when using EM). The difference between the two techniques lies in following:

VB assumes that, even when given a setting of the parameters theta, it is not tractable to compute the posterior p(z | x, theta). Thus, it must be approximated by some variational distribution q(z). On the other hand, EM makes the opposite assumption - given a setting of the parameters theta, it will compute the exact posterior p(z | x, theta). In this sense, VB is a relaxation of EM.
VB is a bayesian technique while EM is a frequentist technique. As such, VB treats theta as a latent, random variable and computes a distribution q(theta) that approximates the posterior distribution p(theta | x). EM, a maximum likelihood technique, computes a point estimate to the parameters theta instead of a full distribution.

With these differences in mind, we can describe the two algorithms using a similar framework.

Objective:
EM’s objective is to maximize a lower bound to p(x | theta), E_q(z)[log(p(x, z | theta))].
VB’s objective is to maximize a lower bound to p(x), E_q(z, theta)[log(p(x, z, theta))].

E-step:
EM first maximizes the objective by fixing theta and then finding an optimal setting of q(z). This optimal setting is always calculated to be the posterior p(z | x, theta).
VM first maximizes the objective by fixing q(theta) and finding a better setting of q(z) (often by some gradient method).

M-step:
EM maximizes the objective by fixing q(z) to the posterior from the previous E-step, and then calculates a maximum likelihood point estimate of the parameters theta.
VM maximizes the objective by fixing q(z) and then finding a better setting of q(theta) (again, often by some gradient method).

The E and M step are then repeated until convergence (i.e. until the point at which the log marginal likelihood no longer improves).

One last point: there is another algorithm called "Variational EM". This algorithm is just the E-step of VB combined with the M-step of EM. Since parameters are not treated as random variables, it is not a bayesian technique.

Frequentists vs Bayesians

2016-05-02T23:32:00.001-07:00

A really wonderful aspect of learning about machine learning is that you can't help but learn about the field statistics as well. As a computer scientist (or really, a software engineer -- I have a hard time calling myself a computer scientist), one of the joys of the field is that many interesting problems require expertise from other fields and so computer scientists tend to be exposed to a great many of ideas (even if only in a shallow sense)!

One of the major divides among statisticians is in the philosophy of probability. On one hand there are frequentists who like to place probabilities only on data; there is some underlying process that the data describes, but one doesn't have knowledge of all data, so there only exists uncertainty about what unknown data the process could generate. On the other hand there are bayesians who, like the frequentists, place probabilities on data for the same reason, but who also place probabilities on the models they use to describe the underlying process and on the parameters of these models. They allow themselves to do this because they concede that, although there does exist one true model that could perfectly describe the underlying process, as the modeler they have their own uncertainty about what that model may be.

I think this is an interesting divide because it highlights two different sources of uncertainty. Some uncertainty exists because there exists true randomness in the world (or so physicists believe), while other uncertainty exists solely due to our own ignorance -- the quantity in question may be completely determined! We are just forced to work with incomplete information. I think it is fascinating that there exists a unified framework that allows us to manage both sources of uncertainty, but it does make me wonder: do these two sources of uncertainty warrant their own frameworks?

An intuition of Newton's method

2016-04-02T15:16:00.003-07:00

During my lazy weekend afternoons (and all the other days of the week) I've been going through Nando de Freitas' undergraduate machine learning course on youtube. In lecture 26 he introduces gradient descent, an iterative algorithm for optimizing (potentially non-convex) functions. I'll refrain from explaining gradient descent more than that as there are many good explanations on the internet (including Professor de Freitas'), but I do want to discuss gradient descent's learning rate parameter and why intuitively Newton's method, also introduced in lecture 26, is able to side-step the learning rate.

As described in the video lecture, a common difficulty with gradient descent is picking the learning rate hyper-parameter. Pick too small of a learning rate and gradient descent will come to crawl and make very little progress or even eventually stop making progress due to floating point underflow. Pick too large of a learning rate and gradient descent will oscillate around the minimum/maximum, bouncing around, very slowly making progress. In an ideal world, as the gradient vanishes, the gradient descent algorithm would be able to compensate by adjusting the learning rate so to avoid both underflow and oscillation. This is exactly what Newton's method does.

Compare gradient descent
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - \eta_{k}\nabla f(\boldsymbol\theta_{k})$$
to Newton's method
$$\boldsymbol\theta _{k+1} = \boldsymbol\theta _{k} - H_{k}^{-1}\nabla f(\boldsymbol\theta_{k})$$

The only difference is that Newton's method has replaced the learning rate with the inverse of the Hessian matrix. The lecture derived Newton's method by showing that $f(\boldsymbol\theta)$ can be approximated by a second-order Taylor series expansion around $\boldsymbol\theta_{k}$. If you're familiar with Taylor series, then this explanation may be sufficient, but I prefer to think about it differently.

What is this Hessian matrix we've replace the learning rate with? Just as the gradient vector is the multi-dimensional equivalent of the derivative, the Hessian matrix is the multi-dimensional equivalent of the second-derivative. If $\boldsymbol\theta_{k}$ is our position on the surface of the function we're optimizing at time step $k$, then the gradient vector, $\nabla f(\boldsymbol\theta_{k})$, is our velocity at which we're descending towards the optimum, and the Hessian matrix, $H$, is our acceleration.

Let's think back to the original problem of choosing the learning rate hyper-parameter. It is often too small, causing underflow, or it is too big, causing oscillation, and in an ideal world we could pick just the right the learning rate at any time step, most likely in relation to the rate at which the gradient vector vanishes, to avoid both these problems. But wait, now we have this thing called the Hessian matrix that measures the rate at which the gradient vanishes!

When starting the algorithm's descent we will quickly gain acceleration as we starting going downhill along the function's surface (i.e. the Hessian matrix will "increase"). We don't want to gain too much velocity else we'll overshoot the minimum and start oscillating, so we'd want to choose a smaller learning rate. As the algorithm nears the optimum, the geometry of the function will flatten out and we will eventually lose acceleration. We don't want to lose too much velocity else we'll underflow and stop making progress. It is precisely when we lose acceleration (i.e. the Hessian matrix "decreases") that we want to increase our learning parameter.

So we have observed that heir is an inverse relationship between the magnitude of the acceleration at which we descend towards the minimum and the magnitude of the most desirable learning parameter. That is to say, at greater accelerations we want smaller learning rates and at smaller accelerations we want larger learning rates. It is precisely for this reason why Newton's method multiplies the gradient vector (the "velocity") by the inverse Hessian matrix ("the inverse acceleration")! At least, this is how I intuitively think about Newton's method. The derivation involving Taylor series expansions is probably the actual precise reason, but I like thinking about it in terms of velocities and accelerations.

Finally, I'd like to end with one cautionary note. I'm an amateur mathematician. I like to think my intuition is correct, but it may not be. By sharing my intuition, I hope to help others who are looking to gain intuition, but if you need anything more than an intuitional understanding of Newton's method or gradient descent (because perhaps you're an actual mathematician in training), please consult a textbook or your favourite professor :-).

On optimizing high-dimensional non-convex functions

2015-12-09T00:06:00.002-08:00

Excuse me for the (in)completeness of this post. What follows is merely a thought, inspired by two independent statements, about a domain of science (or math, really) with which I am barely initiated. Let me give you these two statements first.

In the book Foundations of Data Science, the very first paragraph of the book says

If one generates n points at random in the unit d-dimensional ball, for sufficiently large d, with high probability the distances between all pairs of points will be essentially the same. Also the volume of the unit ball in d-dimensions goes to zero as the dimension goes to infinity.

That is statement one. Statement two is an assertion given in this year's Deep Learning Tutorial at the Neural Information Processing Systems (NIPS) conference. They claim that "most local minima are close to the bottom (global minimum error)."

The second statement is significant for training (deep) neural networks. Training a neural network always uses a variant of a gradient descent, a method for minimizing the neural network's error by iteratively solving for when the gradient (i.e. the derivative) of the function that gives the error is 0. Ideally gradient descent will approach the global minimum error (the point where the neural network will give the most desirable approximations of the training data), but that is only guaranteed when the error function is convex (there are no hills when the function is plotted). Unfortunately, a neural network almost never has a convex error function. Hence, during gradient descent, one might find a local minimum that is nowhere near as low as the global minimum. The second statement however says we shouldn't be so concerned -- all local minima will be close to the global minimum!

But why?

My thought is that somehow the first statement can be used to prove the second statement. The training data of modern neural networks is high dimensional. It will also often be normalized as part of feature scaling. Given those two properties, the high-dimensional hills and valleys of a neural network's error function then (roughly) form unit hemispheres. This (possibly) implies that that local minima are close to the global minimum because the volumes of every valley nears zero, making all the valleys increasing similar to each other as the dimensionality of the training data increases.

I want to stress though, this is a half-baked idea. It's probably rubbish. It might also be redundant! Perhaps the second statement already has a proof and I've missed it. Either way, I would love to see if others find my intuition plausible or, even better, to be pointed in the direction of a proof for the second statement.

TIL: premultiplied-alpha colors

2015-10-03T10:52:00.000-07:00

Alpha is the measure of how translucent an object is. An alpha of 0.0 means the object is entirely transparent, an alpha of 1.0 means the object is entirely opaque, and an alpha in the middle means a fraction of the total light may passthrough the object. Traditionally, a color is represented by 4 constituent components: a red contribution, a green contribution, a blue contribution, and the alpha. When compositing two colors together, one on top of the other, the alpha acts as a modulus of the colors, indicating how much of the top color and how much of the bottom color contribute to the new composited color. The traditional compositing operation is as follows, where A is being composited over top B:

Alternatively, we may wish to premultiply the red, green, and blue components by the alpha:

With this representation we get a new compositing equation:

This new form is interesting for a couple reasons.

It is computationally more efficient. It requires one less vector multiplication.
It is a closed form. Compositing a premultiplied-alpha color over top a premultiplied-alpha color yields another premultiplied-alpha color. The same cannot be said of non-premultiplied-alpha colors. Compositing two non-premultiplied-alpha colors yields, interestingly, a premultiplied-alpha color.
When filtering (aka downsampling), it produces more visually accurate results. A picture is worth a thousands words.

And that's it. Premutliplied-alpha colors are nifty.

TIL: The column space and null space

2015-10-02T21:58:00.002-07:00

A vector space is a set of vectors that is closed under addition and scalar multiplication. In other words, given two vectors, v and w, a vector space is formed by the set of all linear combinations formed between v and w, namely cv + dw for arbitrary coefficients c and d.

Columns spaces and null spaces are special categories of vector spaces that have interesting properties related to systems of linear equations, Ax = b. The column space of the matrix A, C(A), is simply the linear combinations of A's column vectors. This implies that Ax = b may only be solved when b is a vector in A's column space. Finally, the null space of matrix A is another vector space formed by all the solutions to Ax = 0.

TIL: a principled approach to dynamic programming

2015-10-01T16:23:00.000-07:00

Dynamic programming has always been a topic I understood at a surface level (it's just memoization, right?!), but ultimately feared for lack of real-world experience solving such problems. I read today a superb explanation of a principled approach to solving dynamic programming problems. Here it is:

Try to define the problem at hand in terms of composable sub-problems. In other words, ask yourself what information would make solving this problem easier?
Define a more rigorous recurrence relation between the sub-problems and the current problem. What are the base-cases and how do the sub-problems answer the current problem?
Build a solution look up table (up to the point of the current problem) by first initializing it for the base cases and then for the sub-problems in a bottom-up approach. The bottom-up approach is the defining characteristic of a dynamic programming solution. Alone, the look up table is just memoization. Note: building the solution table bottom up will often look like a post-order depth-first search.

Here is an implementation (in Scala) to the following problem:

Given a general monetary system with M different coins of value {c1, c2, . . . , cM}, devise an algorithm that can make change for amount N using a minimum number of coins. What is the complexity of your algorithm?

TIL: Alpha-to-coverage for order-independent transparency

2015-10-01T13:12:00.000-07:00

Alpha-to-coverage is a computer graphics technique, supported by most (all?) modern GPUs, for rendering translucent multi-sampled anti-aliased primitives. Given an alpha value in the range of [0.0, 1.0] and N samples of color stored per pixel, the alpha channel will be discretized into a coverage-mask of the N samples. An alpha of 0 will generate an empty coverage-mask (no samples take on the new color), an alpha of 1 will result generate a full coverage-mask (all samples take on the new color), and values in between will generate a partial coverage-mask (between 1 and N-1 samples take on the new color).

It's not alpha-to-coverage itself that I learned today however, but rather it's implication; Alpha-to-coverage is a form of order-independent transparency! In the naïve case, one benefits from N-layers of transparency. I suppose this is the whole point of alpha-to-coverage, but I never put two-and-two together*.

* I blame my experience. Being a GPU driver engineer, but never a rendering engineer, I'm exposed to large cross-section of rendering techniques and low-level graphics implementation details, but never to the greater-picture of real-time rendering but by word of mouth and self-study.

TIL: Today I learned

2015-10-01T12:47:00.000-07:00

It has been a long time since my last appearance. I've written a couple of unfinished posts in that time, but I've realized that writing a detailed essay that's up to a satisfactory standard of quality is actually quite time consuming (who would have thought!). In an attempt at more productive writing, I will try to publish one post per day expounding on something new I've learned that day. Nothing more than a paragraph or two. My hope is that this will be burden-free enough not to require extraneous effort, yet valuable enough that it's worth recording. Equally, this will serve as a nice personal journal of my educational journey post-schooling.

Encapsulation == Parameterization

2014-05-21T21:37:00.002-07:00

Martin Odersky recently made the interesting observation that encapsulation is equivalent to parameterization. He demonstrated this by showing how abstract types are used (in their natural fashion) for encapsulation in Scala, and then equivocated parameterized types (aka generic classes) to corresponding types where the original type parameters had been transformed into abstract type members.

Intuitively this observation makes sense. Encapsulation's goal is to hide information: don't let me muck with the internals of an implementation. Parameterization's goal on the contrary is to defer, or put another way abstract, implementation details (i.e. abstract what types or values make up the data structure or algorithm that I'm building). Logically, as we increase the amount of parameters, the less we can know. That is to say, by restricting our knowledge we've abstracted, or hidden, implementation details from ourselves! (albeit, in a somewhat inverse manner than traditional encapsulation, but that's besides the point)

As we can see clearly now, encapsulation and parameterization are equivalent; both hide implementations and both abstract interfaces. Put another way, hiding implementations is equivalent to abstracting interfaces.

"Is President Obama Really A Socialist? Let's Analyze Obamanomics" -- Reactions

2013-09-18T17:40:00.000-07:00

I wrote this in response to a post of a Facebook friend the other night. It was never meant to be more than some casual thoughts, and still really, it's only that. However, seeing as I'm awful at staying current on my blog, I thought I might as well not let this digital hunk of "ideas to paper" go to complete waste. So, here it is, a reaction to silly political Forbes article, (almost) completely unedited.

There is a lot wrong (or perhaps misguided, in my opinion) in this article. It's hard to know where to start. However, let me start by saying that what follows is simply to point out the holes of Ferrara's arguments from a factual standpoint, as it is presented; In no way am I asserting that if we accept what he says to be wrong, then the logical implication is that Obama's solutions are automatically right. In fact, I'm not going to talk to Obama's policies at all. So, let's begin.

the top 1% of income earners paid 39% of all federal income taxes, three times their share of income at 13%. Yet, the middle 20% of income earners, the true middle class, paid just 2.7% of total federal income taxes on net that year, while earning 15% of income.

His line of reasoning here only works if you agree that every dollar has equal worth (which, they don't[1]). In that case, what he puts foward does suggest that there is a flaw *somewhere* in the system; I don't think anyone is denying that. However, what is more startling are the ratios between percentage of income to percentage of income earners. The difference between (13/1) and (15/30) means that the top 1% is making 1700% more than the middle 20%! And that's only with the conservative bounds that Farrara has chosen. That gap will certainly expotentially increase as you widen the bounds to include the lower class and the upper class (1% is what I would call "elite upper class").

Any normal person would say that such an income tax system is more than fair, or maybe that “the rich” pay more than their fair share.

Yes, perhaps so, but "any normal person" isn't qualified to make these judgements (we like to think we are, though). Economics, like any other science, is non obvious; that's why we have economists and trained professionals. Anyways, point is: the "any normal person would..." card is a straw-man's argument.

The answer is that to President Obama this is still not fair because he is a Marxist. To a Marxist, the fact that the top 1% earn more income than the bottom 99% is not fair, no matter how they earn it, fairly or not. So it is not fair unless more is taken from the top 1% until they are left only with what they “need,” as in any true communist system. Paying anything less is not their “fair” share. __That is the only logical explanation of President Obama’s rhetoric__, and it is 100% consistent with his own published background.

No it's not. Like most things in this world, there isn't a binary right or wrong, yes or no, "this way or that way answer" solution or answer to everything (or most things for that metter). Following that, his jump to "Obama is a Marxist" is a long leap to make and equally ignorant (I actually doubt this man is ignorant, but rather knows his readers well).

Good tax policy is not guided by “need.” It is guided by what is needed to establish the incentives to maximize economic growth.

Says who? (no, seriously) There's a lot of debate about this, but I hardly think what he suggests here is a "good thing"(™). A tax policy that optimizes only for "economic growth" doesn't speak at all to economic distribution. In my book (okay, I'm interjecting an opinion here), a good tax policy would ensure the largest amount of people are living comfortable, with small but noticeable degrees of variance -- the variance allowing for the opportunity to achieve greater economic success (i.e. to ensure that the motivation to "rise the ranks" still exists, a key component of the "American Dream").

The middle class, working people and the poor are benefited far more by economic growth than by redistribution. That is shown by the entire 20th century, where the standard of living of American workers increased by more than 7 times, through sustained, rapid economic growth.

This is blatantly false. Our initial economic booms were due to labor exploitation (child labor, poor working conditions, no minimum wage, etc.) and the military industrial complex. Secondly, if you look at the tax rates through the 50s and 60s, some of the argueably better economics years of the 20th century, you will see that what Farrara is arguing about tax is completely contrairy to the actual statistics (which, at the very least suggests that there is no causitive relationship between tax rates and economic health, and at the best suggests that high taxes that essentially scale with your income are causitive to the economy's health. In reality, it suggests something in between). For example, take these statistics of incomes adjusted for inflation vs marginal tax rates[2]: in 1960, a single person making $1,163,483 dollars was taxed at the rate of 90% and a head of house making $100,000 was taxed at 36%.

But President Obama’s tax policy of increasing all tax rates on savings and investment will work exactly contrary to such economic growth. It is savings and investment which creates jobs and increases productivity and wages. Under capitalism, capital and labor are complementary

These are unsubstantiated assertions. He gives no reason why I should believe what he says.

Anyways, I must admit, I don't have the energy to persist in my writing. I'm tired and still have homework to do (which involves much, much denser, drier reading than that of Farrara's… :)). I'll just conclude in saying that although there are some claims that Farrara makes that seem to be substantiated (however, I'm skeptical that real substantiation can be provided in a Forbes article), the large majority of his claims aren't sufficiently (or at all) substantiated. For that reason alone, I would take everything that this man writes in such Forbes-like articles with a grain of salt.

[1] Think about the value of $50 to someone making minimum wage to a multi-millionaire or billionaire. If each person respectively were to lose a $50 bill, the so decrease in so called "opportunity cost" between the two is hugely different (i.e. A minimum wage worker will "get a lot more" out of that $50 bill). Ideally "fairness" needs to be judged upon an adjusted scale of relative value.

[2] http://taxfoundation.org/article/us-federal-individual-income-tax-rates-history-1913-2013-nominal-and-inflation-adjusted-brackets

In Good Company

2012-11-14T12:28:00.001-08:00

Regarding my last post, I'm glad to see I'm in good company in feeling that object-oriented programming is not the be-all, end-all to software design (nor is it always a good design). Now if only I could get my school's Software Engineering department to agree...

I should also note, I have a lot more to my opinion than what I crudely wrote in my last post. I just don't feel particularly motivated to expand upon them because religion is religion is religion and I doubt anyone reading this (if there is anyone reading this) will alter their views based on what I have to say.

On the Harm of Object-Oriented Programming

2012-10-18T12:08:00.001-07:00

This is an excerpt of an essay that I wrote for a class, prompted by Dijkstra's famous goto letter. I must admit that it was written in a very "stream-of-conscious" type of way, so it may not be elegant writing, but I think it's worth putting down anyways.

I believe today that the reach of computer science (which I should probably refer to as software engineering as the discussions of goto is more a practical than a theoretical) is sufficiently widespread that is has divided the industries of programming so much that it is not useful or plausible to talk of a single language feature or programming practice being useful or harmful. What is useful for a web application could be considered harmful for an embedded software application, and vice-versa. So I would limit my response to the area that I most familiar: desktop application development.

In my opinion, one of the most harmful programming features is single-paradigm object-oriented programming. Object-oriented programming has been advocated, pushed, and taught so strongly for the past 20 years that it is almost universally used to model any problem we need to tackle as programmers. However, like any tool, there are strengths and weaknesses. At the heart of object-oriented programming, the idea is to model every piece of the program as realistically as possible to a noun that as humans we can understand (which implies that we are attempting to draw a parallel between real life objects and our programs). The consequence is that we group pieces of data that compose that noun together in memory. This has significant downsides on modern desktop computers though. This grouping of data that object-oriented programming encourages is essentially the memory layout referred to as “array of structures” (AOS). Consider an AOS layout for a list of particles being used to model an explosion effect in a video game. In this setup, you may have a geometric vector for a position, velocity, force stored in this object. Presumably, since the particle has these attributes, you need to perform a numerical integration of these attributes over time. This is a loop over your particles where at each iteration you access each attribute, perform the necessary computation, and store the result. In the AOS layout, these properties are stored contiguously in memory and further each particle is stored contiguously in memory. That means on current desktop PCs, where memory access is expensive and the hardware has a lot of builtin mechanisms to reduce the costs, the particle array will only utilize one cache line. On the contrary, let’s reconsider the system we’re building. We designed our program so that each particle is an object. But when will there only exist one particle? The answer is never. The program can be better redesigned to account for the fact that particles are only ever used in mass. So the solution is to split each attribute of a particle into an array of each attribute (the so called “structure of array” layout), and now when iterating over the particle data there will be a cache line utilized by each array, significantly reducing the amount of cache miss and increasing the performance of the application. Arguably you can still have a “ParticleManager” object, so all hope isn’t lost for object-oriented programming; but the point is that single paradigm object-oriented programming languages are so taught and widely used that the mindset we’ve learned has lead us to design our particle system blindly, thinking about real-world abstraction instead of an abstraction that makes sense for both us and the computer. On a final note, I talk of single paradigm object-oriented programming because it is those languages that force us to think in terms of object and don’t allow us to escape the paradigm when it no longer makes sense.

Animating NSSplitView

2012-05-27T17:46:00.000-07:00

A couple of days ago I set out to come up with an emulation of Mail.app's "Mail Activity" panel. For the most part, it was a straight forward implementation. Create a horizontal NSSplitView in IB, set the top view to my source list (or whatever else you want -- it doesn't matter), set the bottom view to a custom view whose drawRect: method have been overridden to draw a background color matching the source list background color.

Okay, that's fine and dandy. All that's left is to implement the "Show/Hide" toggle functionality. There were two constraints I wanted to make sure I abided by:

When toggling from hidden to shown, I want to restore the panel's height to it's previous height when it was last shown.
I want the toggle to animate like in Mail.

Neither criteria are complicated, but neither get any help from NSSplitView provided functionality either.

Restoring the previous height

I can't fathom why, but NSSplitView has a method to set a divider's position (setPosition:ofDividerAtIndex:), but doesn't have a method to query a divider's position. The implementation is trivial, but that just makes me wonder all the more why NSSplitView doesn't have it. Maybe I'm overlooking something, and my implementation is horribly broken; I'm not really sure! But let's assume and hope that I'm not too far off :). Here's my implementation:

- (CGFloat)positionForDividerAtIndex:(NSInteger)idx
{
    NSRect frame = [[[self subviews] objectAtIndex:idx] frame];
    if (self.isVertical) {
        return NSMaxX(frame) + ([self dividerThickness] * idx);
    }
    else {
        return NSMaxY(frame) + ([self dividerThickness] * idx);
    }
}

This can be added as either a category extension or in a subclass. To actually restore the panel to it's previous height is dead simple once you have this. Upon hiding you call positionForDividerAtIndex: on the panel's parent split-view and save the value, and then upon unhiding you use the saved value to set the divider's new position.

Animating the panel

My initial reaction was to animate the frame of the panel view using NSView's animator proxy object. However, that just felt messy to me. There were to many unknowns in my mind -- how would changing the frame of subview interact with the rest of NSSplitView? My guess it that I'd most likely need to animate all subviews of the NSSplitView and do some additional magic in the NSSplitView's delegate methods. I quickly dismissed this option.

My second thought was to somehow animate the split view's divider itself (which you might of guessed by the previous section on restoring a divider's position). This should correctly size the subviews and obey any constraints imposed by the NSSplitView's delegate. The problem is that Core Animation relies on key-value coding to animate arbitrary properties of a class. And although conceptually each divider's position kinda-sorta feels like a property, they aren't. But that was my inspiration!

Before I get into the implementation, let me briefly describe how Core Animation can animate arbitrary properties. The first requirement for the class whose properties you want to animate is for the class to implement the NSAnimatablePropertyContainer protocol. To my knowledge, NSView and NSWindow are the only classes in AppKit that implement this protocol, and there isn't a way to implement this yourself without Apple's SPI. Part of NSAnimatablePropertyContainer is the animator method. It returns a proxy object responsible for initiating animations (the proxy object can also be treated and passed around just like the original object because it will respond to any method the original object responds to). The magic works by checking whether messages sent to the animator proxy conform to a property's key path on the original object. If it does, it will check with its animationForKey: to see if there exists an animation associated with that property, and there does then it will do magic, private CoreAnimation operations to perform the animation using a series of setValue:forKey: method calls for each sample along the animation. If the sent message either doesn't have an associated animation or doesn't represent a proper key-value compliant property, then the proxy will just forward the method to the original object.

So, to animate custom properties that don't have default animations (such as frame or position do) we have to create and associate an animation of our choosing with the property, and then send the appropriate method to the animator proxy object. Here's a quick example to check out.

@interface MyCustomView : NSView

@property (assign) float customAnimatableProperty;

@end

@implementation MyCustomView

@synthesize customAnimatableProperty;

- (void)testAnimation
{
    // Add a custom animation to the animations dictionary
    CABasicAnimation* animation = [CABasicAnimation animation];
    NSMutableDictionary* newAnimations = [NSMutableDictionary dictionary];
    [newAnimations addEntriesFromDictionary:[self animations]];
    [newAnimations setObject:animations forKey:@"customAnimatableProperty"];
    [self setAnimations:newAnimations];
    
    // initiate the animation
    [[self animator] setCustomAnimatableProperty:10.0f];
}

@end

Okay, so that does it for my explanation of Core Animation. Know the kicker: how so animate something that isn't associated with a property (such as setPosition:ofDividerAtIndex:)? My solution? Create a faux key path! By overriding animationForKey:, valueForUndefinedKey: and setValue:forUndefinedKey: and creating setPosition:ofDividerAtIndex:animate: to initiate the animation, I can trick the view into thinking properties exist for each divider. The faux key path is a series of key paths, one for each divider (which I call "dividerPosition0", "dividerPosition1", "dividerPosition2", etc..) and a catch-all key "dividerPosition" that all dividers can use if the user doesn't provide an animation specific to that divider.

setPosition:ofDividerAtIndex:animate: is straightforward. It calls the original setPosition:ofDividerAtIndex: if animate is false, else it calls setValue:forKey on the animator proxy with the appropriate dividerPosition key.

- (void)setPosition:(CGFloat)position ofDividerAtIndex:(NSInteger)dividerIndex animate:(BOOL)animate
{
    if (!animate) {
        [super setPosition:position ofDividerAtIndex:dividerIndex];
    }
    else {
        [[self animator] setValue:[NSNumber numberWithFloat:position] forKey:[NSString stringWithFormat:@"dividerPosition%i", dividerIndex, nil]];
    }
}

The other three methods are small also, but share in common the need to parse the key to check whether the key is a valid "dividerPositionN" key, and if so extract that 'N' suffix integer value for use. animationForKey: will first check whether an animation all ready exists for the key, and if an animation doesn't it will return the defaultAnimation for the "dividerPosition" key.

- (id)animationForKey:(NSString *)key
{
    id animation = [super animationForKey:key];
    NSInteger idx;
    if (animation == nil && [self _tryParsingDividerPositionIndex:&idx fromKey:key]) {
        animation = [super animationForKey:@"dividerPosition"];
    }
    
    return animation;
}

And finally, valueForUndefinedKey: and setValue:forUndefinedKey: are just wrappers around positionForDividerAtIndex: and setPosition:ofDividerAtIndex:

- (id)valueForUndefinedKey:(NSString *)key
{
    NSInteger idx;
    if ([self _tryParsingDividerPositionIndex:&idx fromKey:key]) {
        CGFloat position = [self positionForDividerAtIndex:idx];
        return [NSNumber numberWithFloat:position];
    }
    
    return nil;
}

- (void)setValue:(id)value forUndefinedKey:(NSString *)key
{
    NSInteger idx;
    if ([value isKindOfClass:[NSNumber class]] && [self _tryParsingDividerPositionIndex:&idx fromKey:key]) {
        [super setPosition:[value floatValue] ofDividerAtIndex:idx];
    }
}

And that's it! I stick this code in a NSSplitView subclass, and animate a divider by calling [mySplit setPosition:pos ofDividerAtIndex:divIndex animate:YES].

I hope you've enjoyed this post and find the code useful. However, this code is young and not well-tested. If you come across any problems or have suggestions I'd love to hear them in the comments section. Until next time.. :)

Never do this:

2012-03-21T22:00:00.001-07:00

Rename libc++.1.dylib (in /usr/lib) to something other than libc++.1.dylib. I wanted to install a fresh svn build of libc++ on my home computer. I figured I wanted to keep the original dylib file around just incase anything went horribly wrong (TM). The irony? Renaming libc++.1.dylib basically made my whole machine fail. Internet stopped working, Finder stopped working, spotlight stopped working, launchd crashed on reboot causing my computer to hang, and who know what else was broken because of it.

In hindsight, I guess it makes sense that renaming libc++.1.dylib would cause a lot of stuff to crash since libc++ is a dynamically loaded library. Though I always thought that when a dylib was loaded, it was somehow copied into the resident memory of the application. I guess not.

It does make me wonder though... what IS the correct way to install libc++ if not to replace the file? libc++'s website suggests creating links in /usr/lib but goes on to say "either way to should work" (and then never goes on to say what that other way is..). Hum.

And just to document what I did to fix it:

Booted into single user mode (held command-s during reboot).
Saw a diagnostic message saying that stuff crashed because libc++.1.dylib was missing. This is when I realized that renaming libc++.1.dylib did have such catastrophic effects.
Rebooted into recovery mode (held command-r during reboot).
cd up several directories until I found "Volumes"
cd into Volumes/Macintosh\ HD/usr/lib
renamed the file to libc++.1.dylib (using the 'mv' command).
Rebooted normally, and prayed that this would fix it
And it did!

Hopefully I can now figure out the correct way to install libc++ >.<

Matrix-Vector Multiplication

2012-03-11T12:52:00.001-07:00

I'm finishing up my spring break this week (I know - not much of a "spring" break, but you take what you can get) and I decided to start going through the MIT OpenCourseWare course on Linear Algebra. I want to be a graphics guy when I get out of school and I haven't taken a single linear algebra course yet! Shame on me.

Gilbert Strang is a great lecturer. He teaches linear algebra in a such a way that even the small points he makes makes you go "aha!" I'd never had matrix-vector multiplication explained to me; it was just something I memorized and had to take a leap of faith for. In his lectures, Strang really hits home the importance of this idea of "linear combinations."

Linear combinations are simple. If you have a bunch of vectors and a scalar coefficient for each vector, the linear combination is just the addition of each vector multiplied by its scalar.

As it turns out, that's exactly what matrix-vector multiplication is, too. Each column in your matrix is one of the vectors and each component in your vector is one of the coefficients (the first component of the vector is the coefficient to the first column, the second component of the vector is the coefficient to the second column, etc.).

The typical way to think of matrix-vector multiplication is to just consider your vector as a matrix and do the standard "take a row from the first matrix, take a column from the second matrix, perform the dot product" matrix multiplication.

With the linear combination concept, matrix-vector multiplication becomes much more intuitive (for me at least) because it's just a normal linear combination:

If you work this out, you'll see this is equivalent to the dot-product technique.

The linear combination repurposing of matrix-vector multiplication makes a lot of sense if you imagine the matrix to be a rotation matrix where each column is an axis that forms a basis/coordinate system, and the vector to be a point in space. If you visualize vector addition using the "head-to-tail" concept, you can almost visualize why multiplication of a rotation matrix and a point works!

Computability and the Underpinnings of Computer Science

2012-03-02T16:32:00.002-08:00

This week I concluded a course I took this quarter that covered the theory behind computer science. How comprehensive the course's coverage of a whole field's underpinnings was, I'm not qualified to speculate (but my guess would be: not very). With that said, it did cover a lot of interesting material. The course covered:

The Chomsky hierarchy of formal languages.

Regular languages/(non-deterministic and deterministic) finite state automata/regular expressions/left-linear and right-linear context-free grammars.

Pumping lemma for proving non-regularity of languages

Deterministic push-down automata (and whatever class of languages these are equivalent to)
Context-free grammars/push-down automata

Ambiguity
Griebach Normal Form
Chomsky Normal Form and membership test Cocke-Younger-Kasami (CYK) algorithm.
Pumping lemma for proving language to be not context-free.

An offhand mention of context-sensitive language
Recursively enumerable languages/Turing Machines.

Computability (unfortunately we were not able to solve does P=NP and win that new car).

P time
NP time
NP-hard vs. NP-complete vs. NP
Decidability vs. intractability

Overall I enjoyed the class a lot. I felt the course was not rigorous in its mathematical treatment of a lot of topics, but the professor did provide an alternative practical viewpoint which I appreciated (he was a compiler's guy, so he found grammars to be particularly interesting and snuck in a few question about LL parsers).

What I was disappointed about was the treatment of computability and exactly what the heck it means to be P or NP. I came away feeling only a little less confused about the problem than I was when I started the class.

As always, I went to google. I found a great explanation on stackoverflow (scroll to grom's answer) that was a good balance between layman's terms and formal terms I learned in class.

From my understanding now:

P problems are a class of problems where there is a known algorithm that can provide an answer/solution to the problem in polynomial time. Polynomial time can be O(1), O(n), or O(n^100000000). It doesn't matter.
NP problems are a class of problems where we can guarantee there exists a solution, but we can't compute it in polynomial time; it can only be computed in non-deterministic polynomial time. What that means is that there exists a finite, polynomial amount of steps to find the solution but we can't describe a polynomial algorithm to reach it. We have to resort to an exhaustive, "brute-force" search for the solution. This is a problem because some NP problems are intractable which means even though it's possible to find an answer it may take so long it's not even worth it.
NP-complete problems are problems that are NP but have the special property that all other NP problems can be reduced to them in polynomial time. This "simply" means any NP problem can be transformed into the NP-complete problem using a polynomial time algorithm.
NP-hard problems are problems that are at least as difficult to solve as the hardest NP problem(s). Note, this class of problems includes problems that are harder than any NP problem (which means they may not even be solvable in non-deterministic polynomial time).

So the big question is: does P=NP, meaning are all P problems equivalent to NP problems? If this is true, that means a lot of hard problems we haven't been able to reasonably compute would be solvable in an amount of time that's worthwhile. An example of this type of problem would be breaking cryptography and security codes which could be a severely good thing or bad thing (depending who you are). And so I guess that's why the Clay Mathematics Institute is willing to give any person who solves whether P=NP $1 million for their efforts.