In Keras, there are 2 methods to reduce over-fitting. L1,L2 regularization or dropout layer.
What are some situations to use L1,L2 regularization instead of dropout layer? What are some situations when dropout layer is better?
In Keras, there are 2 methods to reduce over-fitting. L1,L2 regularization or dropout layer.
What are some situations to use L1,L2 regularization instead of dropout layer? What are some situations when dropout layer is better?
I am unsure there will be a formal way to show which is best in which situations - simply trying out different combinations is likely best!
It is worth noting that Dropout actually does a little bit more than just provide a form of regularisation, in that it is really adding robustness to the network, allowing it to try out many many different networks. This is true because the randomly deactivated neurons are essentially removed for that forward/backward pass, thereby giving the same effect as if you had used a totally different network! Have a look at this post for a few more pointers regarding the beauty of dropout layers.
$L_1$ versus $L_2$ is easier to explain, simply by noting that $L_2$ treats outliers a little more thoroughly - returning a larger error for those points. Have a look here for more detailed comparisons.
It seems deciding between L2 and Dropout is a "guess and check" type of thing, unfortunately. Both are used to make the network more "robust" and reduce overfitting by preventing the network from relying too heavily on any given neuron. ie: it is generally believed that it would be better to have many neurons contributing to a model’s output, rather than a select few. L2 and Dropout do this by different means, so you really have to play around to see what gives you a different result.
Dropout randomly mutes some percentage of neurons (provided by you) each forward pass through the network, forcing the network to diversify.
L2 reduces the contribution of high outlier neurons (those significantly larger than the median) and prevents any one neuron from exploding. This also forces the network to diversify.
L1 should really be in its own category, as it is most useful for features selection and small networks. It almost does the opposite of L2 and Dropout by simplifying the network and muting nome neurons.
If you notice that adding a small regularization decreases your accuracy / increases your loss, it's probably because your network was overfitting.