Proximal Operator of Huber Loss Function (For $ {L}_{1} $ Regularized Huber Loss of a Regression Function)

Question

I'm trying to derive the ADMM updates for the $\ell_1$ penalized Huber loss:

$$ \arg\min_x \phi_h \left(y - Ax\right) + \gamma\lVert x \rVert_1 $$

where

$$ \phi_h \left( u \right) = \begin{cases} \frac{1}{2}u^2, & \text{if } \mid u \mid \leq 1 \\ \mid u \mid - \frac{1}{2}, & \text{otherwise} \end{cases} $$

So far I know I need to compute the prox operator of both $ \phi_h $ and $ \lVert \rVert_1 $ and that the steps are:

$$ x^{k+1} = \arg \min_x \left(\phi_h\left(y-Ax\right) + \frac{\rho}{2}\lVert y - Ax -z^{k} + u^{k} \rVert \right) $$

$$ z^{k+1} = S_{\gamma/\rho}\left(x^{k+1} + u^{k+1} \right) $$

$$ u^{k+1} = u^{k} + x^{k+1} - z^{k+1}$$

where

$$ S_{\lambda}\left( y \right) = \mathrm{max} \left(y - \lambda, 0 \right) $$

This is from eqn 6.1. from Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers:

I'm having difficulty finding the $x^{k+1}$ step. Boyd (section 6.1.1) suggests that it will be:

$$ \frac{\rho}{1+\rho}\left(Ax - y + u^k\right) + \frac{1}{1+\rho}S_{1+1/\rho}\left( Ax - y + u^k \right) $$

But the answers to Proximal Operator of the Huber Function suggests the $j^{th}$ component of the prox operator will be:

$$ v_j = \frac{y_j-a_j x_j}{max\left(\mid y_j-a_j x_j \mid, 2 \right)} $$

Any help finding this would be hugely appreciated.

Introduce two variable splittings: $z = y - Ax$ and $w = x$, to obtain the Lagrangian $\mathcal L(x, z, w, u, v) = \phi_h(z) + \gamma|w|_1 + \langle u, y - Ax - z\rangle + \frac{1}{2}\rho|y-Ax-z|_2^2 + \langle v, x - w\rangle + \frac{1}{2}\rho |x-w|_2^2$. Now (cyclically) fix 3 of the variables and optimize w.r.t the 4th. — dohmatob, May 22 '18 at 11:23
... All the proximal operators should now be simple to compute. — dohmatob, May 22 '18 at 11:37
BTW, you problem looks like a natural candidate for primal-dual algorithms https://hal.archives-ouvertes.fr/hal-00490826/document — dohmatob, May 22 '18 at 11:38
I'd be tempted to use FISTA to solve this problem, since the objective function has the form $f+g$, where $f$ is differentiable and $g$ is "simple" (meaning that $g$ has an easy prox-operator). — littleO, Mar 16 '20 at 09:48

Royi · Accepted Answer · 2020-03-21T10:38:19.570

The Huber Loss is defined as:

$$ L_\delta \left( x \right) = \begin{cases} \frac{1}{2} {x}^{2} & \text{for} \; \left| x \right| \leq \delta \\ \delta (\left| x \right| - \frac{1}{2} \delta) & \text{for} \; \left| x \right| > \delta \end{cases} $$

For the case the input is a vector the Huber Loss is applied component wise and then all results are summed.

Regarding your question about the difference between derivations of the Proximal Operator for the Huber Loss.
I actually implemented both derivations of the Proximal Operator for the Huber Loss $ {L}_{1} \left( \cdot \right) $ with $ \delta = 1 $ (To match your definition):

$ {\left( \operatorname{prxo}_{ \lambda {L}_{1} \left( \cdot \right) } \left( y \right) \right)}_{i} = {y}_{i} - \frac{\lambda {y}_{i}}{\max \left( \left| {y}_{i} \right|, \lambda + 1 \right)} $ from Proximal Operator of the Huber Loss Function.
$ \operatorname{prxo}_{ \lambda {L}_{1} \left( \cdot \right) } \left( y \right) = \frac{1}{1 + \lambda} y + \frac{\lambda}{1 + \lambda} \mathcal{S}_{1 + \lambda} \left( y \right) $ where $ \mathcal{S}_{\lambda} \left( \cdot \right) $ is the Soft Threshold Operator (The Proximal Opertaor of the $ {L}_{1} $ Norm). This was taken from Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein - Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers from the section called Huber Fitting. While in the book they use $ \rho = \frac{1}{\lambda} $ notation for the Proximal Operator. Hence I adapted it accordingly.

In my code I found both to be equivalent and accurate as I compared them to CVX. The code is available at my StackExchange Mathematics Q2791227 GitHub Repository. The code is extended to support any value of $ \delta $ as in my solution to Proximal Operator / Proximal Mapping of the Huber Loss Function.

Pay attention that the book Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers use the Huber Loss Function for Robust Regression while you're using it for Regularized Robust Regression. Probably you need to adapt $ \lambda $ in your steps.

Proximal Operator of Huber Loss Function (For $ {L}_{1} $ Regularized Huber Loss of a Regression Function)

1 Answers1

Linked