A Machine Learning Quicky — Why Use L2 Regularization

A really short discussion of why L2 regularization is such a good one


I’ve seen it said many times in many places that you should always use L2 regularization for your neural networks. Not always always, but pretty frequently. I accepted it as gospel, it makes sense as a regularization scheme, but I never could find a reason for why that was so.

Here’s why that’s so.

Think about any machine learning model — a deep neural network, logistic regression, linear regression — and when it predicts anything it’s saying “Given $x$, I’m pretty sure the answer is $y$” where $y = f(x)$ and $f$ is whatever your model is. But $f$ isn’t just a function of $y$, it’s also a function of your training weights $\mathbf{w}$.

Now, imagine $f$ only weakly depends on some subset of weights $\tilde{\mathbf{w}}$, so when we minimize the loss using, say, a standard L2 loss function, we’re minimizining

$$\textrm{Loss} = \sum_j \left ( y_j - f(x_j; \mathbf{w}) \right )^2$$

with respect to the weights $\mathbf{w}$. Now, if $f$ doesn’t depend on some $w^\star$ then the derivative of the loss with respect to $w^\star$ is zero — whatever we initialized $w^\star$ as is what we’re stuck with.

Now imagine the dependence is very weak, so the derivative of $f$ with respect to $w^\star$ isn’t zero, but it’s quite smaller than the other weights. The minimized error is going to have $w^\star$ be quite small, and though we want it to go towards zero it will do so only very slowly, since the derivative is small and gradient descent won’t follow that derivative very quickly.

Now what happens if we add an L2 regularization for $w^\star$? The loss function looks like this:

$$\textrm{Loss} = \sum_j \left ( y_j - f(x_j; \mathbf{w}) \right )^2 + \lambda w^\star{}^2$$

If there is absolutely no dependence of the output on the weight, then the additional L2 regularization term will take $w^\star$ to zero as quickly as gradient descent can find the zero of $(w^\star)^2$ from the initialized value. If the dependence is weak, it will try to make the weight as small as possible without increasing the error.

Essentially, L2 regularization makes the training want to default to a model with no weights whatsoever, and the need to predict accurately competes with the L2 regularization term, dragging the terms as small as possible and trying to make $f$ a sort of null model. In this way, if there’s no relationship whatsoever between $x$ and $y$, $f$ will quickly become a null model instead of having the training chase any random correlations in the noise.

So why not use L1 regularization? Well, what’s the derivative of $|w^\star|$? It’s $-1$ for $w^\star < 0$, $0$ for $w^\star = 0$, and $1$ for $w^\star > 0$. So what happens when we get close to the origin? Well, if we’re inside of $|w^\star| = \epsilon < \lambda$, then $w^\star$ will just oscillate between $\epsilon$ and $\epsilon - \lambda$ (or vice versa, depending on the sign of $\epsilon$) — try it! — and we never converge $w^\star$ to the zero value it so desperately wants.

And that’s why you should use L2 regularization — it makes the default model completely null regardless of the initialized weights, and fights with the prediction errors to make the model as null as possible.

$\setCounter{0}$
Previous
Previous

Solving Multi-Armed Bandits

Next
Next

Modeling the Market