1

I am reading Duchi's notes$^\color{red}{\star}$ and trying to understand why

$$\nabla_A (A B) = B^\top, \qquad \nabla_A \mbox{tr} (A B) = B^\top$$

and why they are the same. Can someone please explain how to derive the gradient of a matrix product and what appropriate dimensions for this gradient are?

The trace being a scalar value and gradient dimension being the dimension of transpose of $B$ makes sense to me since it should be the same as dimension of $A$. But I cant seem to understand how to get gradient of product of matrices and the dimension.


$\color{red}{\star}$ John Duchi, Properties of the Trace and Matrix Derivatives

2 Answers2

3

Suppose $f(A)=\operatorname{tr} (AB)$, then $f(A+H)-F(A) = \operatorname{tr} (HB)$, so we have $Df(A)(H) = \operatorname{tr} (HB)$. (Not surprisingly, since trace is linear.)

In a Hilbert space, the gradient of a functional is an element $\nabla f(A)$ such that $Df(A)(H) = \langle \nabla f(A), H \rangle$ for all $H$.

Since $\langle X, Y \rangle = \operatorname{tr} (X^T Y)$, we see that $\nabla f(A) = B^T$.

This is entirely analogous to a function $g : \mathbb{R}^n \to \mathbb{R}$. The derivative is usually written as a row vector while the gradient is a column vector.

Addendum:

Let $f(A) = \operatorname{tr} (A B A^T C)$. Then we have $f(A+H)-f(A) = \operatorname{tr} (H B A^T C)+\operatorname{tr} (A B H^T C)+\operatorname{tr} (H B H^T C)$. The last term is of order $O(\|H\|^2)$, so we see that $Df(A)(H) = \operatorname{tr} (H B A^T C)+\operatorname{tr} (A B H^T C) $.

The relevant properties of trace are that (i) transpose invariance $\operatorname{tr} X = \operatorname{tr} X^T$ and (ii) shift invariance $\operatorname{tr} (X_1 ... X_n) = \operatorname{tr} (X_2...X_n X_1)$.

Applying these gives \begin{eqnarray} Df(A)(H) &=& \operatorname{tr} ((C^T A B^T)^T H)+\operatorname{tr} ((CAB)^TH) \\ &=& \langle C^T A B^T + CAB, H \rangle \end{eqnarray} from which we get the gradient to be $\nabla f(A) = C^T A B^T + CAB$.

copper.hat
  • 178,207
  • what does this notation mean : Df(A)(H) ? – user179156 Jul 21 '18 at 20:17
  • The derivative of $f$ evaluated at $A$ in the direction $H$. Sometimes written as ${\partial f(A) \over \partial x} (H)$. – copper.hat Jul 21 '18 at 20:21
  • Thanks. But matrix product in not an ordinary function right ? how do i derive gradient for such function (matrix product) and what is the derivative . (In denominator layout ). – user179156 Jul 21 '18 at 20:29
  • I just answered that above? I am using the term ordinary in a colloquial sense. It is of course an ordinary function. – copper.hat Jul 21 '18 at 20:33
  • sorry for the confusion. I don't have sufficient math background to understand relation to hilbert space and how it relates. Would you mind explaining the derivation of gradient of product of matrix in a format similar to the link : https://web.stanford.edu/~jduchi/projects/matrix_prop.pdf , where they derive the gradient for the trace of AB . – user179156 Jul 21 '18 at 20:44
  • Also it seems in your solution you derived gradient of trace(AB) , how do i relate that gradient of AB ? I understand from the usual index notation taking gradient of a trace function wrt different elements of A will give me B transpose. But using the same element wise multiplication, i don't know how to get gradient of AB and what the dimension of gradient should be. – user179156 Jul 21 '18 at 20:52
  • I have no idea what the gradient of $AB$ means. The derivative of $A \mapsto AB$ is straightforward to compute, it is just $H \mapsto HB$. But I don't know what is meant be a gradient of a non scalar function. – copper.hat Jul 21 '18 at 20:54
  • I see , so may be you could you please explain in the link i posted (funky trace derivative) , how does the component CTABT is derived. In that derivation i understand how CTA is coming but the gradient operator on f(A) turns it into BT, ( where f(A) is defined as AB ) . That is the part i don't understand. Also i tried this another link for same derivation http://users.ece.cmu.edu/~asaluja/lms.pdf : section 4.3 : note saying : 7 (note that f`(A) = ∇AAB = BT ) , what does this even mean ? because as you said it is not a scalar function – user179156 Jul 21 '18 at 21:13
  • It is the derivative of a trace which is real valued? I will add an additional part to my answer in a few mins. – copper.hat Jul 21 '18 at 21:19
  • Frankly I find working at the matrix element level for such things to be troublesome, error prone & unintuitive. – copper.hat Jul 21 '18 at 21:35
  • Thanks for the edit , but it still doesn't give the proof I am looking for of what gradient of matrix valued function is. f(A) = AB , then gradient of f wrt A is B transpose ? why is that. Your proof is elegant but somehow works around the thing i am looking an answer for. – user179156 Jul 21 '18 at 22:11
  • Like I wrote above, I have no idea what the gradient of a non scalar function means. The derivative is immediate, but without knowing what is meant by gradient I cannot answer. – copper.hat Jul 21 '18 at 22:16
2

The gradient of a matrix wrt a matrix results in a 4th order tensor.

It can be calculated from the differential $$\eqalign{ C &= AB \cr dC &= dA\,B = {\mathcal H}B^T:dA \cr \frac{\partial C}{\partial A} &= {\mathcal H}B^T \cr }$$ where ${\mathcal H}$ is a 4th order isotropic tensor whose components can be expressed in terms of Kronecker deltas $$\eqalign{ {\mathcal H}_{ijkl} &= \delta_{ik}\,\delta_{jl} \cr }$$ The colon is used to represent the double-contraction product, while juxtaposition represents a single-contraction product. In terms of components $$\eqalign{ M &= {\mathcal H}:X &\implies M_{ij} = {\mathcal H}_{ijkl}\,X_{kl} \cr {\mathcal P} &= {\mathcal H} X &\implies {\mathcal P}_{ijkm} = {\mathcal H}_{ijkl}\,X_{lm} \cr }$$ The trace is just a double-contraction with the identity matrix, i.e. $${\rm tr}(X) = I:X$$ Therefore $$\eqalign{ {\rm tr}\bigg(\frac{\partial C}{\partial A}\bigg) &= \frac{\partial\,{\rm tr}(C)}{\partial A} = I:{\mathcal H}B^T = B^T \cr }$$

frank
  • 561
  • 2
  • 4
  • Still doesn't answer or may be i completely missed : f(A) = AB , then gradient of f wrt A is B transpose ? B transpose doesn't seem to be a 4th order tensor , so may be i am interpreting something wrong from the proof i am reading ? – user179156 Jul 21 '18 at 22:13
  • @user179156 The gradient of the matrix product $(AB)$ is not mentioned at any point in the linked PDF, only the gradient of the scalar-valued function ${\rm tr}(AB)$. However, your question is about the gradient of the former, which is definitely not the matrix $B^T$. – greg Jul 24 '18 at 22:51
  • @greg : yes , i was hoping that what you said is correect and it probably is gradient of trace(AB) , rather than AB. But i am still not clear in the funky trace derivative , why is it trace derivative. According to chain rule it should be product AB ? If you look at this doc (page 5) : http://users.ece.cmu.edu/~asaluja/lms.pdf , they clearly mention it is not trace(AB) but AB. Do you mind explaining why it should be trace derivative when we apply gradient operator to the big trace function. gradient (wrt to A) of trace(f(A)ATC) ? – user179156 Jul 25 '18 at 16:05
  • @user179156 Regarding page 5 of the PDF: The final result is correct, but lines 16,17,18 are an incoherent mess. Their use of the notation $f(A)$ is misleading as well. Conventionally, $f(A)$ and $A$ commute -- but not with their goofy definition. – greg Jul 25 '18 at 19:54
  • @greg : thanks for confirming , do you mind giving me a correct derivation using similar concept/notations as in notes or what is the correction for the mistake/incoherency. I don't have sufficient math background to clearly understand the above solution , or any reference book for matrix calculus would help. – user179156 Jul 26 '18 at 18:31