Optimization problem using Reproducing Kernel Hilbert Spaces

Question

I am encountering a problem concerning Reproducing Kernel Hilbert Spaces (RKHS) in the context of machine learning using Support Vector Machines (SVMs).

With refernece to this paper [Olivier Chapelle, 2006], Section 3, I will try to be brief and focused on my problem, thus I may avoid giving rigorous description of what I am using below.

Let the following optimization problem: $$ \displaystyle \min_{\mathbf{w},b}\: \lVert\mathbf{w}\rVert^2 + C\sum_{i=1}^{n}L(y_i, \mathbf{w}\cdot\mathbf{x}_i+b), $$ where $L(y,t)=\operatorname{max}(0,1-yt)$ is a loss function, the so-called "hinge loss". Trying to introduce kernels, in order to consider non-linear SVMs, the author reformulates the aforementioned optimization problem, looking for a function in a RKHS, $\mathcal{H}$, that minimizes the following functional: $$ F[f]=\lambda\lVert f \rVert^2_\mathcal{H} + \sum_{i=1}^{n}L(y_i, f(\mathbf{x}_i)+b). $$ I understand the following of his work; my question is the following: What if I had some other loss function (not the hinge-loss above), which is not expressed solely by the inner product $\mathbf{w}\cdot\mathbf{x}_i$, which -if I understand correctly- is "replaced" by $f(\mathbf{x}_i)$, but instead I had some loss function of the form: $$ \mathbf{w}\cdot\mathbf{x}_i+b+\sqrt{\mathbf{w}^TA\mathbf{w}}, $$ where $A$ is a positive-definite symmetric matrix? I mean, is there any way of expressing the above quadratic form ($\sqrt{\mathbf{w}^TA\mathbf{w}}$) using the function $f$, such that I can express my optimization problem in the context of RKHS?

On the other hand, the theory suggests that, whatever the loss-function, $L$, is, the solution of the above reformulated problem would be in the form: $$ f(\mathbf{x})=\sum_{i=1}^{n}\alpha_ik(\mathbf{x}_i, \mathbf{x}), $$ where $k$ is the kernel associated with the adopted RKHS. Have I understood that correctly? The solution would be the above even if my loss function included terms like $\sqrt{\mathbf{w}^TA\mathbf{w}}$?

Joel · Answer 1 · 2014-06-19T20:17:45.307

5

Beginning with the answer to your second problem, suppose that $f \in H$, where $H$ is the reproducing kernel hilbert space. Let $S$ be the subspace spanned by the kernel functions $k(x_i, \cdot)$. Then by the theory of Hilbert spaces, $f$ can be written as $f = f_S + f_P$ where $$f_S(x) = \sum_{i=1}^n a_i k(x_i, x),$$ and $f_P$ is a function perpendicular to $S$. Moreover by the Pythagorean theorem $$\| f \|^2 = \| f_S \|^2 + \| f_P \|^2.$$ In particular this tells us that $\|f\| > \|f_S\|$ if $f_P \neq 0$.

Now consider $f(x_i)$, which can be written as $$f(x_i)=\langle f, k(x_i, \cdot) \rangle = \langle f_S, k(x_i, \cdot) \rangle + \langle f_P, k(x_i, \cdot) \rangle = \langle f_S, k(x_i, \cdot) \rangle + 0 = \langle f_S, k(x_i, \cdot) \rangle = f_S(x_i)$$

Thus for every $f$ we have $$\sum_{i=1}^n L(y_i, f(x_i) + b) = \sum_{i=1}^n L(y_i, f_S(x_i) + b)$$

Hence, $$F[f] = \lambda \| f\|^2 + \sum_{i=1}^n L(y_i, f(x_i) + b) > \lambda \| f_S\|^2 + \sum_{i=1}^n L(y_i, f_S(x_i) + b) = F(f_S)$$ and this holds for all $f \in H$. This means if a function is going to minimize $F$, it must be in the subspace $S$ and is a linear combination of kernel functions.

As for the first question, quadratic terms resembling $w^T A w$ appear through what is known as the Graham matrix, which is made from kernels: $$K = \left( k(x_i,x_j) \right)_{i,j=1}^n.$$ It is straightforward to prove that this matrix is symmetric and positive (semi)-definite, since if $a = (a_1, a_2, ..., a_n)$ then $$a^T K a = \left\langle \sum_{i=1}^n a_i k(x_i, \cdot), \sum_{j=1}^n a_j k(x_j, \cdot)\right\rangle=\left\|\sum_{i=1}^n a_i k(x_i, \cdot)\right\|^2$$

This gives us our first hint at how to recast $w^T A w$ into our language of reproducing kernel hilbert spaces.

Take for instance $$A = diag(a_1,a_2,a_3,..., a_n)$$ where each $a_i > 0$. Then $$w^T A w = \sum_{i=1}^n a_i w_i^2$$

Now imagine replacing $w$ with $f$, and each $w_i = f(x_i)$. Then $$\sum_{i=1}^n a_i w_i^2 = \sum_{i=1}^n a_i f(x_i)^2$$

By the same reasoning above, $$\sum_{i=1}^n a_i f(x_i)^2 = \sum_{i=1}^n a_i f_S(x_i)^2$$, and so we may add this to the loss function and still be guaranteed that a minimizer will be a linear combination of kernel functions.

So in short, you may introduce the term you want into your loss function. Here keeping in mind that $w = (f(x_1), f(x_2),...,f(x_n))$.

edited Jun 19 '14 at 20:17

answered Jun 19 '14 at 19:59

Joel

16,574

I am not capable of checking its correctness, but it seems very helpful for sure! I will try to understand it, and -if I cannot do so- I will come back for more info! Thank you very much! If someone else could verify your response, or even give some other perspective, that would be nice! – nullgeppetto Jun 19 '14 at 20:05
1

There are definitely bits that need some work in here. If you have any specific questions, please feel free to ask. I plan on adding a little bit here too. – Joel Jun 19 '14 at 20:07
Thanks a lot! I will be waiting for that! – nullgeppetto Jun 19 '14 at 20:08
Thanks for the edit. In fact, I had been thinking about the quadratic, and I found out that, in the case that the matrix $A$ is diagonal, I could possibly express it somehow using the norm, as you wrote above. But what if the matrix $A$ is not diagonal, but a generic symmetric, positive-define matrix, instead? Does your approach apply to this case too? – nullgeppetto Jun 19 '14 at 20:24
1

Yes it will work for non-diagonal matrices. All we really need is the identification $f(x_i) = f_S(x_i)$. – Joel Jun 19 '14 at 20:26
Thanks once again! I will return, if necessary! – nullgeppetto Jun 19 '14 at 20:27
1

If you want you can write $$w^T A w = \sum_{i,j=1}^n a_{ij} w_i w_j = \sum_{i,j=1}^n a_{ij} f(x_i) f(x_j) = \sum_{i,j=1}^n a_{ij} f_S(x_i) f_S(x_j)$$ – Joel Jun 19 '14 at 20:28
I have added an example below. Could you take a look? Thanks a lot, anyway! – nullgeppetto Jun 29 '14 at 09:59
are you there? :) – nullgeppetto Jun 30 '14 at 15:27
Yes I have seen your comment. I have been a little too busy the past few days. Might take me until the end of the week to look it over seriously. – Joel Jun 30 '14 at 17:00
Thanks a lot, Joel! Just two quick questions (if you have the time), and I will re-write my 2nd post correctly. First, if $f(x)=\sum_{i=1}^{l}a_i k(x_i,x)+b$, what would be the squared root of the norm of $f$? Could it be $\mathbf{a}^TK\mathbf{a}+b^2$, or what? Second, why $w_j=f(\mathbf{x}_j)$? Would it be different if our solution function had a biad term $b$?Many thanks for your time! – nullgeppetto Jun 30 '14 at 18:21
1

For the first question, $$|f|^2 = \langle f, f \rangle = \sum_{i,j=1}^l a_i a_j k(x_i,x_j) + 2 \sum_{i=1}^l a_i \langle b, k(x_i, x) \rangle + b^2 = a^T K a + 2 b \sum_{i=1}^l a_i + b^2$$ but we also know that $$|f| \le \sqrt{a^T K a} + |b|$$ by the triangle inequality. – Joel Jun 30 '14 at 18:29
1

$w_j = f(x_j)$ since we are now thinking of this as the evaluation of a function from an infinite dimensional Hilbert space. – Joel Jun 30 '14 at 18:31
Thanks a lot, Joel! I'll try to sum all this info up and rephrase my second post! Not forgetting to acknowledge your precious help, of course! – nullgeppetto Jun 30 '14 at 18:33
I am glad I could help :) – Joel Jun 30 '14 at 18:37
Dear @Joel, I have posted another question concerning the proof of a (semi-parametric) representation theorem. I need your help! It seems that you are the only person here to be that familiarized with this stuff, or just there are some other people who just ignore me! So, here is the question, if you have some time, I need just the first step of the proof: http://math.stackexchange.com/questions/855751/how-to-prove-the-semi-parametric-representer-theorem (Thanks!) – nullgeppetto Jul 04 '14 at 05:42
I will give it some thought. Don't know the solution off the top of my head. I will be out for a few days though. – Joel Jul 05 '14 at 04:53
Thanks @Joel! I have stuck at the very first step of the proof. I mean, in the first (non-parametric) theorem they just define a $\phi(\cdot)$ to start the proof. So, what would be my first step? Any time you can, please give it some thought! Thanks a lot! I'm thinking about starting a bounty, because I really need the proof quickly! – nullgeppetto Jul 05 '14 at 08:26
Just to inform you that I have started a bounty (unfortunately, just for 50 reps, as I'm a poor man!) for this question... I think that maybe you're the only one that could help... – nullgeppetto Jul 07 '14 at 11:21
Dear@Joel, I have to return to an older question of mine; 7 comments above, you say that $w_j=f(x_j)$. I really don't get it, and to tell the truth it doesn't seem to work in my problem either... We have agreed that $\mathbf{w}\cdot\mathbf{x}_i=f(\mathbf{x}_i)$... So, how is that true that also holds $\mathbf{w}=(f(\mathbf{x}_1),\ldots,f(\mathbf{x}_n))$? Moreover, we have $l$ inputs $\mathbf{x}_i$... I really need some help!The question actually concerns the expression of the quadratic $\mathbf{w}^TA\mathbf{w}$, where $\mathbf{w}\in\mathbb{R}^n$, and $A$ is a symmetric pd $n\times n$ matrix... – nullgeppetto Jul 07 '14 at 14:27
Doesn't my last comment make sense? There must be an error, if the following hold simultaneously: $\mathbf{w}\cdot\mathbf{x}_i=f(\mathbf{x}_i)$ and $\mathbf{w}=(f(\mathbf{x}_1),\ldots,f(\mathbf{x}_n))^T\in\mathbb{R}^n$... – nullgeppetto Jul 07 '14 at 20:38
Hi @Joel, sorry for being that annoying, but is there a chance for you to give some though to my question? As you know, it concerns a tiny (ok, not so tiny) detail, as you have solved the rest. It would be nice if you could help me once more... Please let me know what's your intention! Thanks a lot! – nullgeppetto Jul 12 '14 at 16:31
Even though not all my questions have been answered, I believe that @Joel deserves the award! If Joel wishes, he can put some additional thought to the details remain unanswered yet! – nullgeppetto Jul 15 '14 at 12:41
Thanks @nullgeppettoo :) I will keep thinking about it. Thing is I don't really work in that field, but I will let you know if I come up with anything. – Joel Jul 15 '14 at 13:48
Thanks @Joel! The truth is that I believe that the quadratic form cannot be expressed in general using RKHS... If you have any ideas sometime, please let me know! – nullgeppetto Jul 15 '14 at 13:58
Hi @Joel! I need your help in something! If you have time, of course! Could you explain the substitution $\lVert\mathbf{w}\rVert^2_2 \leftrightarrow \lVert f \rVert^2_{\mathcal{H}}$? Moreover, may you have a look at the main question left unanswered yet (about the quadratic form and its expression via $f$)? Thanks a lot! – nullgeppetto Aug 16 '14 at 13:54

nullgeppetto · Accepted Answer · 2014-07-11T11:39:18.960

I would like to clarify my final question as descibed by the discussion with @Joel above (see the comments).

Let $\mathbf{w}=(w_1,\ldots,w_n)^T$, $\mathbf{x}_i=(x_{i1},\ldots,x_{in})^T\in\mathbb{R}^n$, $i=1,\ldots,m$, and $A=\big(a_{ij}\big)_{i,j=1}^{n}$ an $n\times n$ symmetric positive definite real matrix.

Let's suppose that we would like to minimize the following quantity with respect to $\mathbf{w}$

$$ J =\mathbf{w}\cdot\mathbf{w} + \sum_{i=1}^{m}\mathbf{w}\cdot\mathbf{x}_i + \mathbf{w}^TA\mathbf{w}. $$

Instead of the above optimization problem, we choose to look for a function $f$ that minimizes a functional, so that the problem remains equivalent with the first one. Let this function belong to a Reproducing Kernel Hilbert Space $\mathcal{H}$. The appropriate functional should be of the form $$ \Phi[f]= \big\lVert f \big\rVert^2_{\mathcal{H}} + \sum_{i=1}^{m}f(\mathbf{x}_i) + \cdots, $$ but I do not know how to express the quadratic form $\mathbf{w}^TA\mathbf{w}$ in terms of f. May you help?

What I have thought so far is as follows. We have "replaced" the quantity $\mathbf{w}\cdot\mathbf{w}$ by the norm $\big\lVert f \big\rVert^2_{\mathcal{H}}$, so we probably could write $$ \mathbf{w}^TA\mathbf{w} = \mathbf{w}^T\big(LDL^T\big)\mathbf{w} = \mathbf{w}^T \big(LD^{1/2}\big)\big(LD^{1/2}\big)^T \mathbf{w} = \mathbf{w'}\cdot\mathbf{w'}, $$ where $\mathbf{w'}=\Big(LD^{1/2}\Big)^T\mathbf{w}$. Could we then find a connection between the norm we should use instead?

Hey Geppetto, I have been busy with my own research, and I haven't had time to look your question over. You should ask another question on math.se instead of posting another one as an answer. It will get more attention. Also answering a question posted as an answer is a little awkward. — Joel, Jul 13 '14 at 16:20
Thanks Joel for responding! Actually I did ask elsewhere my second post as a standalone question, but I found no luck... You're right this is awkward, but I'm planning to fix it as long as it's answered. I will include it in my original post, acknowledging whoever answers, hopefully you! I know you have your own research, and -of course- you are by no means obligated to answer my questions, but it seems that you are possibly the only one who understands what I want, or at least the only one who really cared about. Do you noticed what is the detail? It seems that there is a contradiction... — nullgeppetto, Jul 13 '14 at 16:31

Optimization problem using Reproducing Kernel Hilbert Spaces

2 Answers2

Linked