0

Extending this question. How to obtain the gradient of ($\ell1$ penalized) \begin{align} L(W_1, W_2, W_3) := \sum_{i=1}^N \| W_3 \ g\left(W_2 \ f\left(W_1 x_i \right) \right) - y_i \|_2^2 + \lambda \left( \| W_3\|_1 + \| W_2\|_1 + \| W_1\|_1\right)\ , \end{align} with respect to $W_1$, $W_2$, and $W_3$?

The definition of $x_i \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$, $W_3 \in \mathbb{R}^{q \times p}$, and $y_i \in \mathbb{R}^q$, and $f(z) = g(z) = \frac{1}{1 + \exp(-z)}$.


EDIT:

The gradient of the first $\ell2$ norm of the cost function is given in the link. But how to address it with $\ell1$ regularization such that one can find the optimal weights.


Thank you so much in advance for your help

learning
  • 743
  • The notation $|W|_1$ is ambiguous. Does it denote the Schatten/Nuclear norm or the Holder/Manhattan norm? (Interestingly, the Holder and Schatten norms coincide for $|W|_2$, so there's no ambiguity) – lynn Mar 25 '19 at 17:31
  • Sorry for the ambiguity. We can assume Nuclear norm (or any norm that promotes sparsity and relatively easy to compute). – learning Mar 25 '19 at 18:11

1 Answers1

1

Let $F=F(W_1,W_2,W_3)$ denote the function from your linked answer. Then this the new function is simply $$L = F + \lambda\,\Big(\|W_1\|_1 + \|W_2\|_1 + \|W_3\|_1\Big)$$ Consider what happens when you vary $W_1$ holding $(W_2,W_3)$ constant. $$\eqalign{ dL &= dF + \lambda\,\Big(d\|W_1\|_1 +0+0\Big) \cr &= \bigg(\frac{\partial F}{\partial W_1} + \lambda\,W_1(W_1^TW_1)^{-1/2}\bigg):dW_1 \cr \frac{\partial L}{\partial W_1} &= \frac{\partial F}{\partial W_1} + \lambda\,W_1(W_1^TW_1)^{-1/2} \cr }$$ where the gradient $\frac{\partial F}{\partial W_1}$ is known from the linked answer.

To calculate the other two gradients, simply repeat this process.
First, by holding $(W_1,W_3)$ constant and varying $W_2$.
Then, by holding $(W_1,W_2)$ constant and varying $W_3$.

lynn
  • 3,441
  • Thank you. Just clarification, which norm have you assumed? Nuclear norm? – learning Mar 25 '19 at 19:30
  • 1
    Yes, the nuclear norm. – lynn Mar 25 '19 at 20:47
  • Sorry some further questions: (a) if my last weight $W_3$ is a row vector, then how would I take the derivative of that $\ell_1$ norm of a vector (since it 's non-differentiable). (b) Also, if the inverse doesn't exist, then I guess I need to do Tikhonov like regularization? – learning Mar 26 '19 at 15:59