2

I have tried to get $$\frac{d}{d\vec{x}}\left[\vec{x}^T\vec{x}\right].$$

One approach is to use a component-wise example in 3D. $\begin{bmatrix}x_1 & x_2 & x_3\end{bmatrix}\cdot\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = x_1^2 + x_2 ^2 + x_3^2$

this derived wrt the vector $\vec{x}=\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix}$ should give $$\begin{bmatrix}\frac{\partial }{\partial x_1}(x_1^2 + x_2 ^2 +x_3^2) \\ \frac{\partial}{\partial x_2}(x_1^2 + x_2 ^2 +x_3^2)\\ \frac{\partial}{\partial x_3}(x_1^2 + x_2 ^2 +x_3^2)\end{bmatrix}=\begin{bmatrix}2x_1\\2x_2\\2x_3\end{bmatrix}$$

On the other hand, using the product rule: $$\frac{d}{d\vec{x}}\left[\vec{x}^T\vec{x}\right] = \frac{d}{d\vec{x}}\vec{x} + \vec{x}^T \frac{d}{d\vec{x}} = \vec{x}+\vec{x}^T$$ These cannot be added together because they have different dimensionalities. So what did I do wrong? And more importantly, what is the correct derivative of $\vec{x}^T\vec{x}$?

MrYouMath
  • 16,174
lucidbrot
  • 262

2 Answers2

5

The easiest way is to use the implicit /external definition of the gradient (can be obtained by the chain rule)

$$d F=dx^T\,\nabla F.$$

EDIT: Explanation of to obtain the external definition of the gradient. Consider a function $F=F(x_1,...,x_n)$ Then the total derivative is given by

$$dF = \dfrac{\partial F}{\partial x_1}dx_1+...+\dfrac{\partial F}{\partial x_n}dx_n=dx_1\dfrac{\partial F}{\partial x_1}+...+dx_n\dfrac{\partial F}{\partial x_n}$$ $$=dx^T\begin{bmatrix}\dfrac{\partial F}{\partial x_1}\\\vdots\\\dfrac{\partial F}{\partial x_n} \end{bmatrix}=dx^T\,\nabla_\text{column} F=\nabla_\text{row}F\,dx $$

What we have to do is to determine the total derivative of your expression

$$d(x^Tx)=dx^T x+x^Tdx.$$

Note, that both expressions are scalars hence we can transpose the second one to obtain the first expression:

$$d(x^Tx)=dx^T x+dx^Tx=dx^T\left[2x\right]$$

Comparing this expression with the implicit definition of the gradient we obtain

$$\dfrac{dx^Tx}{dx^T}=\nabla \left[x^Tx \right]=2x.$$


An alternative approach is to calculate the partial derivatives

$$\dfrac{\partial \sum_{j=1}^n x_j^2}{\partial x_i}=\sum_{j=1}^n\dfrac{\partial x_j^2}{\partial x_i}=2x_i$$

and then assemble the gradient as $2x$.


Or using index notation (summation over double indices)

$$\dfrac{\partial x_jx_j}{\partial x_i}=\dfrac{\partial x_j}{\partial x_i}x_j+x_j\dfrac{\partial x_j}{x_i}=\delta_{ji}x_j+x_j\delta_{ji}=x_i+x_i=2x_i.$$

The symbol $\delta_{ij}=\delta_{ji}$ is the Kronecker delta / permutation function.

MrYouMath
  • 16,174
  • I couldn't follow where you got the first two expressions from and why it is a valid move to transpose a $dx$. But your explanation using the index notation is very helpful, even though it is what I already tried in my first approach. – lucidbrot Feb 23 '18 at 10:05
  • 1
    The first expression is just the external/implicit definition of the gradient. The second expression is using the product rule for the total derivative. We can transpose $x^Tdx$ as it is a scalar (row vector $\cdot$ column vector results in a scalar) and if you can always transpose a scalar without changing it. Additionally, you have to know that if you transpose $x^Tdx$ you will have to change the order in the product $(dx)^T(x^T)^T=dx^Tx$. – MrYouMath Feb 23 '18 at 10:20
  • Oh, now the second Eq. is obvious, thanks. Regarding the transposing, I guess the fact that we can treat $dx$ like a normal variable is a topic in itself (?) Do you happen to know a link where i can read up on the first definition? Googling for implicit definition of gradient did not yield much useful for me. – lucidbrot Feb 23 '18 at 10:32
  • Added the explanation for the external/implicit definition of the gradient – MrYouMath Feb 23 '18 at 10:43
  • I'm having some trouble with the dimensions. Where did you get the first line? Why is it $$dF = dx^T \nabla F$$ instead of just $$dF = dx \nabla F$$? From my understanding, it is just a rearrangement of the original derivative, which was asked in terms of x – information_interchange Feb 03 '19 at 20:42
  • @information_interchange: Depending on the definition of the gradient (in the first equation it is defined as column vector). The second equation does not make sense without further explanation. $dF$ is a scalar (hence $1\times 1$, $dx$ is a columnvector from $\mathbb{R}^n$ (hence, $dx$ has format $n\times 1$) and the gradient has the format $n \times 1$ (this is not possible because the matrix product would not be defined) or $1 \times n$ (this is possible, but would lead to a $dF$ with the format $n \times n$, which contradicts the scalar property). – MrYouMath Feb 04 '19 at 20:13
4

The first you have done its true. you can not generalize every theorem of scalar analysis to the vector analysis simply.The derivative of a scalar relative to a vector is defined as a vector which its entries are achieved by derivating relative to the vector entries.By doing this you can conclude that: $$ d(x^T*A*x)/d(x)=(A+A^T)x$$ for arbitrary matrix A. So set $$A=I$$ and conclude your desired.

user528935
  • 161
  • 4
  • Thanks! There's one thing left confusing me though: We could get $(A+A^T)x$ as well as $x^T(A+A^T)$, right? With $A=I$, these both mean multiplying every element in x with 2. But $Ix$ would give a column vector while $x^T I$ gives a row vector. How can those be the same? – lucidbrot Feb 23 '18 at 09:57
  • 2
    Indeed there is no rule over the derivative over a scalar relative to a vector to be a column vector or a row one. This depend on the author view. For instance F.Lewis in his book Optimal control set it as a column vector but H.Khalil set it as a row vector in his book Nonlinear systems. If you accept one of these definitions you should consider it during deriving your relations. So both of $$(A+A^T)x$$ and $$x^T(A+A^T)$$ are true.Here I have used the column notation for derivative. – user528935 Feb 23 '18 at 17:12