3

Question

Define the function $f: (0, \infty) \to \mathbb{R}$ by $$f(c) = \min_{x \in \mathbb{R}^n \, : \, \|x\| = c} \|b - A x\|_2^2,$$ for $A \in \mathbb{R}^{m \times n}$ with full rank, $b \in \mathbb{R}^m$, and $\|\cdot\|$ some norm. How do I show that $f(c)$ is differentiable? Is it possible to show that $f''(c) \ge 0$ for $c$ less than $\|\hat{x}\|$, to be defined later?

Thoughts toward solution

Plug into derivative

Plugging this function into the definition of the derivative isn't illuminating.

"Geometric" interpretation

There is a clear geometric interpretation of this problem, since, $$f(c) = \|b - A \hat{x} \|_2^2 + \min_{\|x\| = c} (x - \hat{x})^T (A^T A) (x - \hat{x}),$$ for $\hat{x} = (A^TA)^{+}A^T b \in \arg\min_{x \in \mathbb{R}^n} \|b - A x\|_2^2$ and $\left( \cdot \right)^+$ the pseudo-inverse. Thus, we need only to consider the function \begin{align*} g(c) & = \min_{\|x\| = c} (x - \hat{x})^T (A^T A) (x - \hat{x}) \\ & = c^2 \min_{\|x\| = 1} (x - c^{-1} \hat{x})^T (A^T A) (x - c^{-1} \hat{x}) \\ & = c^2 \left( \mathbf{d}(c^{-1} \hat{x}, \Omega) \right)^2, \end{align*} where $\mathbf{d}(x,y) = \|A(x-y)\|_2^2$ is a pseudo-metric (and a metric if $A$ is "skinny") and $\Omega = \{x \in \mathbb{R}^n \, : \, \|x\|=1 \}$. Notice that $g$ is differentiable if and only if the function $$c \mapsto \mathbf{d}^2(c \hat{x}, \Omega) \tag{*}$$ is differentiable for $c \in (0, \infty)$ and a fixed point $\hat{x}$.

A very special case as an example: note that if $A^T A = I$, $\|\cdot\| = \|\cdot\|_1,$ and $|\hat{x}_j| = |\hat{x}_k|$ for all $j,k$, then the projection $\Pi_\Omega(\hat{x}) = \mathrm{sgn}(\hat{x})$, and $\mathbf{d}^2(c\hat{x}, \Omega) = \left( \sqrt{2} \left| \frac{c}{\|\hat{x}\|}-1 \right| \right)^2$, which is differentiable and has positive second derivative.

Later, I add this more general example:

We will reexpress $\hat{x}$ in a basis where each basis vector is orthogonal to a face of the $\ell_1$ norm level set. Assume that $\|c \hat{x} \|_1 > 1$ so that the point $c \hat{x}$ is outside of the unit ball. Assume without loss of generality that $\hat{x}_j > 0$. Then, since now each component of $c\hat{x}$ measures it's distance from a face, we have that $d(c \hat{x}, \Omega) = \left[ \sum_{j=1}^n (c \hat{x}_j - 1)^2 \mathbf{1}_{c > d_j} \right]^{1/2},$ where $\{d_j\}$ is the set of knots of $\mathbf{proj}_\Omega c\hat{x}$. Therefore, the distance $d$ is has a derivative which increases across each knot, so that it's convex.

Perhaps this is a convex program on part of its domain?

I also think that it could be true that $f(c)$ is decreasing on $(0, \|\hat{x}\|)$ so that $$f(c) \stackrel{?}{=} \min_{x \in \mathbb{R}^n \, : \, \|x\| \leq c} \|b - A x\|_2^2,$$ for $c \in (0, \|\hat{x}\|)$. This would make this a convex program and hence more amenable to analysis. I think this could be true since as $C$ increases within $(0, \|\hat{x}\|)$, the value $f(c)$ will get "closer" to the unconstrained minimum $\|b - A \hat{x}\|_2^2$.

Further comments on question

If possible, it would be interesting to know how general the norm $\|\cdot\|$ could be while still having the result being provable. If it helps to simplify the problem, I'm particular interested in the case $\|x\| = \|x\|_1$. As Rodrigo kindly pointed out, the case of $\|x\| = \|x\|_2$ follows from noticing that "ridge regression" estimator $f(c)$ has a closed form.

1 Answers1

1

Let us rewrite your optimization problem as $$\min_{x\in \Omega} \|b-cAx\|_2^2$$ where $\Omega$ is a compact set independent of $c$. Let us suppose that $b = Ak$ for some point $k$. Then $$\min_{x\in \Omega} \|b-cAx\|_2^2 = d(k/c, \Omega)$$ where $d()$ is the distance in $\mathbb{R}^n$ with respect to the metric $A^TA$; this distance is differentiable away from the cut locus of $\Omega$, and $f(c)$ is differentiable away from $c=0$ and values of $c$ for which $k/c$ lies on the cut locus of $\Omega$.

This answer is not complete since a few details need to be worked out:

  • handling the case where $A$ is not surjective (I don't think this is a major issue);
  • if $k/c$ lies on the cut locus, it may still be the case that $f(c)$ is differentiable even though $d$ is not. This is true when $\Omega$ is the polyhedron $\{x \in \mathbb{R}^n\ \mathrm{s.t.}\ \|x\|_1 = 1\}$ for instance, $A=I$, and $b$ lies on one of the axes;

but it proves differentiability in the generic case and hopefully is enough to get you started.

user7530
  • 50,625
  • Thank you for the help! I've edited my question to make some things more clear. We've considered the same construction but you've taken it further. (See my "geometric interpretation" section.) Notice in particular that I discuss that this construction works even if $A$ is not surjective. I don't understand your comments related the "cut locus". Could you clarify what this means? In particular, could you clarify what it means in the context of showing the differentiability of the (*)'d function in my question? – user795305 Jul 25 '17 at 16:36
  • Like you mention, when the $|\cdot|$ sphere is smooth, the projection onto it smoothly varies, and the (pseudo-)distance is differentiable. However, what approach is possible when the projection has finitely many points of nondifferentiability, like in the case that $|\cdot| = |\cdot|_1$? – user795305 Jul 25 '17 at 16:44
  • 1
    @Ben The cut locus is roughly speaking the set of points in $\mathbb{R}^3$ that have multiple closest points on $\Omega$. The distance function generally is not differentiable on the cut locus, and if $k/c$ crosses it there's a possibility of the minimizer $x$ of $f$ jumping to a different point on $\Omega$. I don't think this possibility can be ruled out for general $\Omega$. I'd have to think more about the specific case of $||_1$. – user7530 Jul 25 '17 at 17:48