1

I'm considering a support vector regression model with a prediction $$ \hat{y}(\mathbf{x}_\star)=\boldsymbol{\theta}^{\top} \boldsymbol{\phi}(\mathbf{x}_\star)$$ where $\boldsymbol{\theta}$ are the coefficients to learn, $\boldsymbol{\phi}(\mathbf{x}_\star)$ is a transformation of the input $\mathbf{x}_\star$. The optimisation problem is $$ \widehat{\boldsymbol{\theta}}=\arg \min _{\boldsymbol{\theta}} \frac{1}{n} \sum_{i=1}^n \max \{0,|y_i-\underbrace{\boldsymbol{\theta}^{\top} \boldsymbol{\phi}\left(\mathbf{x}_i\right)}_{\hat{y}\left(\mathbf{x}_i\right)}|-\epsilon\}+\lambda\|\boldsymbol{\theta}\|_2^2 .$$ where the error term is given by the hinge loss and there is l2 regularisation. The parameter $\epsilon$ gives the epsilon-tube within which no penalty is associated in the cost.

Deriving the dual problem via Lagrange multipliers we find that the solution is given by $$\hat{y}\left(\mathbf{x}_{\star}\right)=\hat{\boldsymbol{\alpha}}^{\top} \underbrace{\Phi(\mathbf{X}) \phi\left(\mathbf{x}_{\star}\right)}_{K\left(\mathbf{X}, \mathbf{x}_{\star}\right)}$$ where $\boldsymbol{\alpha}$ is the solution to the optimisation problem $$\hat{\boldsymbol{\alpha}}=\arg \min _{\boldsymbol{\alpha}} \frac{1}{2} \boldsymbol{\alpha}^{\top} \boldsymbol{K}(\mathbf{X}, \mathbf{X}) \boldsymbol{\alpha}-\boldsymbol{\alpha}^{\top} \mathbf{y}+\epsilon\|\boldsymbol{\alpha}\|_1$$ subject to $$ \left|\alpha_i\right| \leq \frac{1}{2 n \lambda} \text {. }$$ In the above, we have used the Gram matrix, which is given by $$\boldsymbol{K}(\mathbf{X}, \mathbf{X})=\begin{bmatrix} \kappa\left(\mathbf{x}_1, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_1, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_1, \mathbf{x}_n\right) \\ \kappa\left(\mathbf{x}_2, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_2, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_2, \mathbf{x}_n\right) \\ \vdots & & \ddots & \vdots \\ \kappa\left(\mathbf{x}_n, \mathbf{x}_1\right) & \kappa\left(\mathbf{x}_n, \mathbf{x}_2\right) & \ldots & \kappa\left(\mathbf{x}_n, \mathbf{x}_n\right) \end{bmatrix}$$ where the kernel is $$ \kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\boldsymbol{\phi}(\mathbf{x})^{\top} \boldsymbol{\phi}\left(\mathbf{x}^{\prime}\right).$$ In support vector classification we don't have the term $||\mathbf{\alpha}||_1$ and the optimisation is a quadratic problem. Can the optimisation above be put into a quadratic form? Intuitively, I feel like it can't. So what numerical algorithms exist to solve problems like this?

Royi
  • 10,050
oweydd
  • 249

1 Answers1

1

The problem is formulated as:

$$ \begin{align*} \arg \min_{\boldsymbol{x}} \quad & \frac{1}{2} \boldsymbol{x}^{T} \boldsymbol{K} \boldsymbol{x} - \boldsymbol{x}^{T} \boldsymbol{y} + \varepsilon {\left\| \boldsymbol{x} \right\|}_{1} \\ \text{subject to} \quad & \begin{aligned} \left| {x}_{i} \right| & \leq \frac{1}{2 \lambda n} \end{aligned} \end{align*} $$

It can be solved both as Proximal Gradient Descent and ADMM.

Proximal Gradient Descent

Since the constraint is basically a box constraint it can be incorporated into iterative proximal operator (See Lasso ADMM with Positive Constraint).

So the problem can be solved with the atoms of:

  • $ f \left( \boldsymbol{x} \right) = \frac{1}{2} \boldsymbol{x}^{T} \boldsymbol{K} \boldsymbol{x} - \boldsymbol{x}^{T} \boldsymbol{y} \Rightarrow \nabla f \left( \boldsymbol{x} \right) = \boldsymbol{K} \boldsymbol{x} - \boldsymbol{y} $.
  • $ \operatorname{Prox}_{\varepsilon g} \left( \boldsymbol{z} \right) = \max \left\{ \min \left\{ \mathcal{S}_{\varepsilon} \left( z \right), \frac{1}{2 \lambda n} \right\}, -\frac{1}{2 \lambda n} \right\} $ where $\mathcal{S}_{\varepsilon} \left( \cdot \right)$ is the Soft Threshold Operator which its output is box constrained.

The code implements Accelerated Proximal Gradient Descent (FISTA style).
It is able to solve the case of $n = 500$ in ~0.1 [Second].

ADMM

Defining the proximal of $f$ and using the same proximal for $g$:

  • $ f \left( \boldsymbol{x} \right) = \frac{1}{2} \boldsymbol{x}^{T} \boldsymbol{K} \boldsymbol{x} - \boldsymbol{x}^{T} \boldsymbol{y} \Rightarrow \operatorname{Prox}_{\rho f} \left( \boldsymbol{z} \right) = \arg \min_{\boldsymbol{x}} \frac{\rho}{2} {\left\| \boldsymbol{x} - \boldsymbol{z} \right\|}_{2}^{2} + f \left( \boldsymbol{x} \right) = {\left( \boldsymbol{K} + \rho \boldsymbol{I} \right)}^{-1} \left( \boldsymbol{y} + \rho \boldsymbol{z} \right) $.
  • $ \operatorname{Prox}_{\varepsilon g} \left( \boldsymbol{z} \right) = \max \left\{ \min \left\{ \mathcal{S}_{\varepsilon} \left( z \right), \frac{1}{2 \lambda n} \right\}, -\frac{1}{2 \lambda n} \right\} $ where $\mathcal{S}_{\varepsilon} \left( \cdot \right)$ is the Soft Threshold Operator which its output is box constrained.

The code pre factorize ${\left( \boldsymbol{K} + \rho \boldsymbol{I} \right)}^{-1}$ using Cholesky Decomposition.

It is roughly 3 times slower per iteration.
Yet it converges much faster, hence effectively almost an order of magnitude faster.


enter image description here

The full code is available on my StackExchange Mathematics GitHub Repository (Look at the Mathematics\Q4929444 folder).

Royi
  • 10,050
  • thanks for this detailed answer. I realised there was actually a couple of other steps to my question, which is how do you arrive at the optimisation problem that I gave in the first place? I restructured the question and asked it on cross-validated stack exchange. I can re-edit this question if it helps? – oweydd Jul 04 '24 at 10:30
  • @oweydd, I can have a look on the question. But I think my answer solves the one written here. Could you please mark it? – Royi Jul 04 '24 at 19:47
  • Thanks for that. You're quite right! Its been marked – oweydd Jul 05 '24 at 09:28