3

I am trying to calculate the following gradient

$$\nabla_{\mathbf{X}} \left( \mathbf{a}^{T} \mathbf{X} \mathbf{a} \right)$$

where I am using the convention that $\mathbf{a}$ is a column vector. I am wondering what the steps are to extract the solution from the matrix cookbook, which is:

$$\nabla_{\mathbf{X}} \left( \mathbf{a}^{T} \mathbf{X} \mathbf{a} \right) = \mathbf{a}\cdot\mathbf{a}^{T}$$

2 Answers2

5

See this question for the basics and the notation.

The derivative of the scalar function $f(X)$ with respect to $X$, where $X$ is a matrix, is the matrix $A$ with $A_{i,j}=\dfrac{df(X)}{dX_{i,j}}$.

And here,

$$f(X)=a^TXa=\sum_{i,j} X_{i,j}a_ia_j$$

So that

$$\dfrac{df(X)}{dX_{i,j}}=a_ia_j$$

And finally

$$A=\frac{df(X)}{dX}=aa^T$$

Jean-Claude Arbaut
  • 23,601
  • 7
  • 53
  • 88
  • 1
    I have some basic linear algebra questions, Firstly, if a dimensions is 5x1 and X dimension is 5x5, is it that the result of $a^{T}Xa$ will be a scalar? Secondly, why I can write the first expression $\Sigma_{i,j}X_{i,j}a_{i}a_{j}$? Then, I cannot understand since there is not a swap in indexes of i and j why I have the swap for the transpose. – Jose Ramon Sep 08 '20 at 07:01
  • 1
    @JoseRamon Yes, the result is a scalar. And it is critical to understand it to well to know what is happening! – mathcounterexamples.net Sep 08 '20 at 07:09
  • I guess this is because it is scalar ($\Sigma_{i,j}X_{i,j}a_{i}a_{j}$), right? But then I do no see the transpose. – Jose Ramon Sep 08 '20 at 07:11
  • Yes I got the scalar part, so in the example for the partial derivative $\frac{\partial f(\mathbf{X})}{\partial X_{ij}} = a_{i}a_{j}$. Then I compose my result with all these partial derivatives. Then the final matrix has size NxN. But why transpose? – Jose Ramon Sep 08 '20 at 07:14
  • @JoseRamon It's an outer product: the $(i,j)$ element of the matrix is the prodyct $a_ia_j$. It's exactly the same as $aa^T$ (check for yourself, the matrix product $aa^T$ is trivial here). – Jean-Claude Arbaut Sep 08 '20 at 07:21
  • @Jean-ClaudeArbaut yes I think I am getting closer :) – Jose Ramon Sep 08 '20 at 07:24
  • @JoseRamon Regarding your first question: $b^TXa=\sum_i b_i (Xa)i=\sum_i b_i(\sum_j X{i,j}a_j)=\sum_{i,j} X_{i,j}b_ia_j$ (where $X$ is a matrix and $a,b$ are vectors, with compatible dimensions). Note that $Xa$ is a vector, and for two vectors $u,v$, $u^Tv$ is their scalar product. – Jean-Claude Arbaut Sep 08 '20 at 07:31
  • In this case is not scalar right? It is the outer product of b and a. – Jose Ramon Sep 08 '20 at 07:33
  • The product $aa^T$ is not a scalar, but a matrix. However the product $a^Ta$ is a scalar. Just write the matrix product, and consider vectors to be column vectors (a matrix with one column). And ins $b^TXa$, you have the scalar product of $b$ and $Xa$, which are both vectors. Hence the function you differentiate is a scalar function of the matrix $X$. – Jean-Claude Arbaut Sep 08 '20 at 07:34
  • $(df/dX)(X) = a \cdot a^T$ is a linear form that associates to a matrix $u$ a scalar. How do you obtain the scalar knowing the matrices $a \cdot a^T$ and $u$? – mathcounterexamples.net Sep 08 '20 at 07:39
  • 1
    @mathcounterexamples.net See https://math.stackexchange.com/questions/2807864/derivative-of-the-trace-of-the-product-of-a-matrix-and-its-transpose/2809102#2809102, where I wrote the detailed derivation. $df/dX$ is indeed a linear form, but it's written in compact form as a matrix, by convention: instead of writing a vector with $np$ entries, it's more compact to write a $n\times p$ matrix. See also the Wikipedia link above and the layout convention part (there are two competing conventions). – Jean-Claude Arbaut Sep 08 '20 at 07:44
  • @Jean-ClaudeArbaut Thanks. I understand now! I don't know what you think, but this seems very complex to use such results. Moreover, in term of practical stuff, do you know if those conventions are used in usual linear programming packages? – mathcounterexamples.net Sep 08 '20 at 07:55
2

$$\begin{array}{l|rcl} f : & M_n(\mathbb R) & \longrightarrow & \mathbb R\\ & X & \longmapsto & a^T X a \end{array}$$

is a linear map.

Critical is to understand what the domain and codomain of $f$ are in order to understand what $f$ is as a function.

Hence its Fréchet derivative at each point is equal to itself: $f^\prime(X).u =a^T u a$.

Following a detailed and interesting discussion with Jean-Claude Arbaut (see the comments!), we can rewrite

$$f^\prime(X).u =a^T u a = \mathrm{tr}(a^T u a) = \mathrm{tr}(u \cdot (a \cdot a^T))= \mathrm{tr}((a \cdot a^T) \cdot u) = \mathrm{tr}(A \cdot u)$$

where $A = a \cdot a^T$ is defined as the matrix calculus derivative of $f$ with respect to $X$. This is in fact what is meant by

$$\nabla_{\mathbf{X}} \left( \mathbf{a}^{T} \mathbf{X} \mathbf{a} \right) = \frac{\partial\left( \mathbf{a}^{T} \mathbf{X} \mathbf{a} \right)}{\partial \mathbf{X}}=\mathbf{a}\cdot\mathbf{a}^{T}$$ in the Matrix Cookbook.