0

I'm trying to derive the ADMM updates for the $\ell_1$ penalized Huber loss:

$$ \arg\min_x \phi_h \left(y - Ax\right) + \gamma\lVert x \rVert_1 $$

where

$$ \phi_h \left( u \right) = \begin{cases} \frac{1}{2}u^2, & \text{if } \mid u \mid \leq 1 \\ \mid u \mid - \frac{1}{2}, & \text{otherwise} \end{cases} $$

So far I know I need to compute the prox operator of both $ \phi_h $ and $ \lVert \rVert_1 $ and that the steps are:

$$ x^{k+1} = \arg \min_x \left(\phi_h\left(y-Ax\right) + \frac{\rho}{2}\lVert y - Ax -z^{k} + u^{k} \rVert \right) $$

$$ z^{k+1} = S_{\gamma/\rho}\left(x^{k+1} + u^{k+1} \right) $$

$$ u^{k+1} = u^{k} + x^{k+1} - z^{k+1}$$

where

$$ S_{\lambda}\left( y \right) = \mathrm{max} \left(y - \lambda, 0 \right) $$

This is from eqn 6.1. from Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers:

I'm having difficulty finding the $x^{k+1}$ step. Boyd (section 6.1.1) suggests that it will be:

$$ \frac{\rho}{1+\rho}\left(Ax - y + u^k\right) + \frac{1}{1+\rho}S_{1+1/\rho}\left( Ax - y + u^k \right) $$

But the answers to Proximal Operator of the Huber Function suggests the $j^{th}$ component of the prox operator will be:

$$ v_j = \frac{y_j-a_j x_j}{max\left(\mid y_j-a_j x_j \mid, 2 \right)} $$

Any help finding this would be hugely appreciated.

Royi
  • 10,050
Tom Kealy
  • 282
  • @dohmatob might find this question interesting – Tom Kealy May 22 '18 at 09:50
  • 1
    Introduce two variable splittings: $z = y - Ax$ and $w = x$, to obtain the Lagrangian $\mathcal L(x, z, w, u, v) = \phi_h(z) + \gamma|w|_1 + \langle u, y - Ax - z\rangle + \frac{1}{2}\rho|y-Ax-z|_2^2 + \langle v, x - w\rangle + \frac{1}{2}\rho |x-w|_2^2$. Now (cyclically) fix 3 of the variables and optimize w.r.t the 4th. – dohmatob May 22 '18 at 11:23
  • 1
    ... All the proximal operators should now be simple to compute. – dohmatob May 22 '18 at 11:37
  • BTW, you problem looks like a natural candidate for primal-dual algorithms https://hal.archives-ouvertes.fr/hal-00490826/document – dohmatob May 22 '18 at 11:38
  • I'd be tempted to use FISTA to solve this problem, since the objective function has the form $f+g$, where $f$ is differentiable and $g$ is "simple" (meaning that $g$ has an easy prox-operator). – littleO Mar 16 '20 at 09:48

1 Answers1

2

The Huber Loss is defined as:

$$ L_\delta \left( x \right) = \begin{cases} \frac{1}{2} {x}^{2} & \text{for} \; \left| x \right| \leq \delta \\ \delta (\left| x \right| - \frac{1}{2} \delta) & \text{for} \; \left| x \right| > \delta \end{cases} $$

For the case the input is a vector the Huber Loss is applied component wise and then all results are summed.

Regarding your question about the difference between derivations of the Proximal Operator for the Huber Loss.
I actually implemented both derivations of the Proximal Operator for the Huber Loss $ {L}_{1} \left( \cdot \right) $ with $ \delta = 1 $ (To match your definition):

  1. $ {\left( \operatorname{prxo}_{ \lambda {L}_{1} \left( \cdot \right) } \left( y \right) \right)}_{i} = {y}_{i} - \frac{\lambda {y}_{i}}{\max \left( \left| {y}_{i} \right|, \lambda + 1 \right)} $ from Proximal Operator of the Huber Loss Function.
  2. $ \operatorname{prxo}_{ \lambda {L}_{1} \left( \cdot \right) } \left( y \right) = \frac{1}{1 + \lambda} y + \frac{\lambda}{1 + \lambda} \mathcal{S}_{1 + \lambda} \left( y \right) $ where $ \mathcal{S}_{\lambda} \left( \cdot \right) $ is the Soft Threshold Operator (The Proximal Opertaor of the $ {L}_{1} $ Norm). This was taken from Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein - Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers from the section called Huber Fitting. While in the book they use $ \rho = \frac{1}{\lambda} $ notation for the Proximal Operator. Hence I adapted it accordingly.

In my code I found both to be equivalent and accurate as I compared them to CVX. The code is available at my StackExchange Mathematics Q2791227 GitHub Repository. The code is extended to support any value of $ \delta $ as in my solution to Proximal Operator / Proximal Mapping of the Huber Loss Function.

Pay attention that the book Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers use the Huber Loss Function for Robust Regression while you're using it for Regularized Robust Regression. Probably you need to adapt $ \lambda $ in your steps.

Royi
  • 10,050