How do I calculate the derivative of a composition $R^{n} \rightarrow R^{n \times n} \rightarrow R^{n}$?

Question

I am having problems calculating the derivative of a function.

Let $C:\mathbb{R}^{n \times n} \longrightarrow \mathbb{R}^{n}$ with $C(M) = (I - M)^{-1}(I + M)x_0$ for (I - M) invertible ($x_0 \in \mathbb{R}^n$, $I$ identity) and $J:\mathbb{R}^{n} \longrightarrow \mathbb{R}^{n \times n}$ with $J(x)^T = -J(x)$ for all $x \in \mathbb{R}^n$ be two mappings.

What is the derivative of the function $f(x) = C(J(x)) = (I - J(x))^{-1}(I + J(x))x_0$ with respect to $x$?

How does the chain rule apply here? I would say that the derivative has the form $D_xf(x) = 2(I - J(x))^{-1} D_x J(x) (I + J(x))^{-1}x_0$, where $ D_x J(x)$ is a third-order tensor or something like that. But as you can see, I'm not very familiar with the subject.

So $ J $ is just some differentiable mapping from $ \mathbb R ^ n $ to $ \mathbb R ^ { n \times n } $, while $ C $ is the mapping given by $ C { ( M ) } = { ( I - M ) } ^ { - 1 } { ( I + M ) } x _ 0 $? where $ I $ is some $ n $-by-$ n $ matrix and $ x _ 0 $ is some $ n $-dimensional vector (interpreted as an $ n $-by-$ 1 $ matrix). (So $ C $ is undefined at those matrices $ M $ such that $ I - M $ is singular, but otherwise $ C $ is also differentiable.) Also, is $ I $ just any $ n $-by-$ n $ matrix, or did you mean to say that it's the identity matrix? — Toby Bartels, May 04 '24 at 13:57
Just use the chain rule and this https://math.stackexchange.com/questions/1471825/derivative-of-the-inverse-of-a-matrix for each coordinate? That is using reasonable assumptions on $J$. — , May 04 '24 at 13:57
Yes, $I$ is the identity, moreover $J(z)$ is a skew-symmetric operator for all $z$. I have corrected it. The skew symmetry guarantees the existence of the inverse — Donnie, May 04 '24 at 14:16
You may find the symbols in the answers below strange or unfamiliar. Put it simply, for each $k\in{1,2,\ldots,n}$, the $k$-th column of the $n\times n$ Hessian matrix of $f$ is given by \begin{align} 2\left(I-J(x)\right)^{-1},\frac{\partial J(x)}{\partial x_k},\left(I-J(x)\right)^{-1}x_0, \end{align} where $\frac{\partial J(x)}{\partial x_k}$ is the $n\times n$ matrix whose $(i,j)$-th element is $\frac{\partial (J(x))_{ij}}{\partial x_k}$, the partial derivative of the $(i,j)$-th element of $J(x)$ with respect to $x_k$. — user1551, May 05 '24 at 06:09

greg · Answer 1 · 2024-05-05T11:30:21.470

4

$ \def\G{{\large\Gamma}} \def\s{\star} \def\h{\cdot} \def\o{{\tt1}} \def\BR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\q{\quad} \def\qq{\qquad} \def\qif{\q\iff\q} \def\qiq{\q\implies\q} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\red#1{\color{red}{#1}} \def\CLR#1{\red{\LR{#1}}} $For typing convenience, let $$\eqalign{ J &= J(x) \\ f &= \LR{I-J}^{-1}\LR{I+J}x_0 \\ w &= f+x_0 \\ }$$ The gradient of $J$ wrt the vector $x$ is a third-order tensor $\G$ with components $$\eqalign{ \G_{ijk} = \grad{J_{ij}}{x_k} \qiq dJ = \G\h dx \\ }$$ Substituting this into the differential of $f$ recovers the desired gradient $$\eqalign{ df &= \LR{I-J}^{-1}\,\red{dJ}\,x_0 \;+\; \LR{I-J}^{-1}\,\red{dJ}\,f \\ &= \LR{I-J}^{-1}\,\red{dJ}\,w \\ &= \LR{\LR{I-J}^{-1}\s w}:\red{dJ} \\ &= \LR{\LR{I-J}^{-1}\s w}:\CLR{\G\h dx} \\ \grad fx &= \LR{\LR{I-J}^{-1}\s w}:\G \\ }$$ where $(\s)$ is the dyadic/tensor product and $(:)$ is the double-contraction product.

edited May 05 '24 at 11:30

answered May 04 '24 at 19:19

greg

40,033

1

This seems like awfully complicated notation. Isn't the result just $\delta_{x_i}f(x)=(I-J(x))^{-1}(\delta_{x_i}J(x))(I-(I-J(x))^{-1}(I+J(x)))x_0$? – May 04 '24 at 19:41
1

where $\delta_{x_i} J(x)$ is the matrix where each component gets differentiated in the $i$-th variable, i.e. if $J(x)=(j_{m,n}(x))$ then $\delta_{x_i} J(x)=(\delta_{x_i}j_{m,n}(x))$. – May 04 '24 at 20:03
@underflow you are right so far, I just ran it with sympy for different scenarios, just a small correction, I think it is $(I + (I -J(x))^{-1}(I + J(x))$ in the second part of the derivative. – Donnie May 04 '24 at 21:01
@Donnie Ah yes, my bad , sorry :) – May 04 '24 at 21:07
@underflow greg is right to point out that you need to use a tensor product for $(I-J)^{-1}$ and $w$ (using his notation). Somehow you have reduced that part to a matrix product which you then apply to a third order tensor! – Ted Black May 06 '24 at 23:13
@TedBlack I didn't say he was wrong or anything, just that you could express it in a more wildly understood way. Notation should be there to simplify things. I mean if you are used to something it will always be straightforward and in your world it may be the most aesthetic and natural. But one should still consider the audience. I think my notation and reasoning is comprehensible to any first years student and above. – May 06 '24 at 23:23
@underflow the only problem with your notation is that it looks like a matrix product. As greg pointed out it is a double contraction of two third order tensors (one of which is the tensor product of a matrix and a vector) which can be written as ${Q^a}{ib}{\Gamma^{bc}}_a$ where ${\Gamma^{bc}}_a=\partial {J^b}_a / \partial x_c$ and ${Q^a}{ib}={((I-M)^{-1})^a}_i w_b$. – Ted Black May 06 '24 at 23:47
@TedBlack It is a matrix product. I am just taking the derivative in one coordinate at once. The idea is to treat it as a function from $\mathbb{R}\to\mathbb{R}^n$ which eases notation. – May 06 '24 at 23:52

Toby Bartels · Answer 2 · 2024-05-05T03:16:08.837

The derivative of a map from $ \mathbb R ^ m $ to $ \mathbb R ^ n $ is an $ n $-by-$ m $ matrix, but if the map goes between spaces of matrices, then yes, you need higher-rank tensors. It can be helpful to work entry by entry, or to use (which looks more or less the same) abstract index notation. But we can also stick to matrices if we work with differentials instead of derivatives, since this doesn't increase the rank. (Formally, you can think of the differential as the partial derivative with respect to some scalar component of $ x $ without specifying which one, although there are other ways to think of it.) Also to avoid having to use the rule for differentiating an inverse matrix (which as noted in the answer that @underflow linked to in a comment, is $ \mathrm d ( K ^ { - 1 } ) = - K ^ { - 1 } \, \mathrm d K \, K ^ { - 1 } $ instead of anything involving $ K ^ { - 2 } $), I'll rewrite your formula for $ f $ as the equation $ \big ( I - J ( x ) \big ) \, f ( x ) = \big ( I + J ( x ) \big ) \, x _ 0 $.

So $$ \eqalign { \mathrm d \Big ( \big ( I - J ( x ) \big ) \, f ( x ) \Big ) & = \mathrm d \Big ( \big ( I + J ( x ) \big ) \, x _ 0 \Big ) \\ \mathrm d \big ( I - J ( x ) \big ) \, f ( x ) + \big ( I - J ( x ) \big ) \, \mathrm d \big ( f ( x ) \big ) & = \mathrm d \big ( I + J ( x ) \big ) \, x _ 0 + \big ( I + J ( x ) \big ) \, \mathrm d ( x _ 0 ) \\ \Big ( \mathrm d ( I ) - \mathrm d \big ( J ( x ) \big ) \Big ) \, f ( x ) + \big ( I - J ( x ) \big ) \, \mathrm d \big ( f ( x ) \big ) & = \Big ( \mathrm d ( I ) + \mathrm d \big ( J ( x ) \big ) \Big ) \, x _ 0 + \big ( I + J ( x ) \big ) \, 0 \\ \Big ( 0 - \mathrm d \big ( J ( x ) \big ) \Big ) \, f ( x ) + \big ( I - J ( x ) \big ) \, \mathrm d \big ( f ( x ) \big ) & = \Big ( 0 + \mathrm d \big ( J ( x ) \big ) \Big ) \, x _ 0 \\ \big ( I - J ( x ) \big ) \, \mathrm d \big ( f ( x ) \big ) & = \mathrm d \big ( J ( x ) \big ) \, f ( x ) + \mathrm d \big ( J ( x ) \big ) \, x _ 0 \\ \mathrm d \big ( f ( x ) \big ) & = \big ( I - J ( x ) \big ) ^ { - 1 } \, \mathrm d \big ( J ( x ) \big ) \, \big ( { f ( x ) + x _ 0 } \big ) \text . } $$ You could expand $ f ( x ) $ on the right-hand side into its definition, but it's easier to read if you don't.

If you want the partial derivative of $ f ( x ) $ with respect to the $ i $th component of $ x $, then replace $ \mathrm d $ above with $ \partial _ i $, which is short for $ \partial / \partial x _ i $ (or $ \partial / \partial x ^ i $ to distinguish upper and lower indices). If we write $ J ^ i _ j $ for the entry in row $ i $ and column $ j $ of the matrix $ J ( x ) $, $ K ^ i _ j $ for the corresponding entry of $ \big ( I - J ( x ) \big ) ^ { - 1 } $, $ f ^ i $ for the $ i $-th entry of $ f ( x ) $, and $ { x _ 0 } ^ i $ for the $ i $-th entry of $ x _ 0 $, then we get $$ \partial _ i f ^ j = \sum _ { k = 1 } ^ n \, \sum _ { l = 1 } ^ n \, K ^ j _ k \, \partial _ i J ^ k _ l \, ( f ^ l + { x _ 0 } ^ l ) \text , $$ which you can abbreviate as $ \partial _ i f ^ j = K ^ j _ k \, \partial _ i J ^ k _ l \, ( f ^ l + { x _ 0 } ^ l ) $ using the Einstein summation convention; or you can interpret this expression as abstract index notation. Note that $ \partial _ i J ^ k _ l $ is (a component of) a rank-$ 3 $ tensor as you suspected, with contravariant rank $ 1 $ and covariant rank $ 2 $. If you write things like this, then there's no direct indication what $ K $ means, so you have to keep track of that $ ( \delta ^ i _ j - J ^ i _ j ) \, K ^ j _ k = K ^ i _ j \, ( \delta ^ j _ k - J ^ j _ k ) = \delta ^ i _ k $, where $ \delta $ is the Kronecker delta (the components of the identity matrix, or the identity matrix itself in abstract index notation). Similarly, $ f ^ i = K ^ i _ j \, ( \delta ^ j _ k + J ^ j _ k ) \, { x _ 0 } ^ k $. You could also start with this and do the whole derivation in this notation.

Now, I noticed something while checking my work, that I probably wouldn't have thought of otherwise, which is that we can do something with that expression $ f ( x ) + x _ 0 $ (or $ f ^ l + { x _ 0 } ^ l $) that appears in the answer. If you expand $ f ( x ) $ as $ \big ( I - J ( x ) \big ) ^ { - 1 } \, \big ( I + J ( x ) \big ) \, x _ 0 $ and also write $ x _ 0 $ as $ \big ( I - J ( x ) \big ) ^ { - 1 } \, \big ( I - J ( x ) \big ) \, x _ 0 $ (this is the non-obvious part), then $ f ( x ) + x _ 0 $ factors as $ \big ( I - J ( x ) \big ) ^ { - 1 } \, \Big ( \big ( I + J ( x ) \big ) + \big ( I - J ( x ) \big ) \Big ) \, x _ 0 $, which simplifies to $ 2 \, \big ( I - J ( x ) \big ) ^ { - 1 } \, x _ 0 $. So we get $$ \mathrm d \big ( f ( x ) \big ) = 2 \, \big ( I - J ( x ) \big ) ^ { - 1 } \, \mathrm d \big ( J ( x ) \big ) \, \big ( I - J ( x ) \big ) ^ { - 1 } \, x _ 0 \text , $$ or $$ \partial _ i f ^ j = 2 \, K ^ j _ k \, \partial _ i J ^ k _ l \, K ^ l _ m \, { x _ 0 } ^ m \text . $$ This might be nicer to work with. (And it looks a lot more like your guess, although I don't know how you got that.)

Ah, I forgot to simplify $(I+(I-J(x))^{-1}(I+J(x)))=2(I-J(x))^{-1}$. I like your notation! — , May 05 '24 at 03:50
Thanks for the answer. To the question how i got the idea of my solution, i have read the book Geometric Numerical Integration by Hairer and in this book the derivation of the Cayley transformation for a matrix is presented, this is also treated in link, where the corresponding lemma is given. — Donnie, May 05 '24 at 08:33

score 2 · Accepted Answer · answered May 05 '24 at 04:24

For completeness I write up the solution I've written in the comments.

$$\delta_{x_i}f(x)=\delta_{x_i}\left((I-J(x))^{-1}(I+J(x))x_0\right)=$$ $$\big(\delta_{x_i}(I-J(x))^{-1}\big)(I+J(x))x_0+(I-J(x))^{-1}\big(\delta_{x_i}(I+J(x))\big)x_0=$$ $$(I-J(x))^{-1}(\delta_{x_i}J(x))(I-J(x))^{-1}(I+J(x))x_0+(I-J(x))^{-1}(\delta_{x_i}J(x))x_0=$$ $$(I-J(x))^{-1}(\delta_{x_i}J(x))\big((I-J(x))^{-1}(I+J(x))+I\big)x_0=$$ $$2(I-J(x))^{-1}(\delta_{x_i}J(x))(I-J(x))^{-1}x_0$$ The usage of the chain rule is justified since if we have the matrix functions $A(t)=(a_{l,j}(t))$ and $B(t)=(b_{l,j}(t))$ for $t\in\mathbb{R}$ we have that an entry in $A(t)B(t)$ will be a sum of terms of the form $a_{l,j}(t)b_{m,n}(t)$ so everything works out and an analog argument shows that $\delta_{x_i}(A(t)x_0)=(\delta_{x_i}A(t))x_0$. In the second equality we've used this. Note that if $J(x)=(j_{m,n}(x))$ then $\delta_{x_i}J(x)=(\delta_{x_i}j_{m,n}(x))$.

Thanks for the answer now i have understood it, I primarily had problems with the dimensions, but this way it is easy to understand. — Donnie, May 05 '24 at 08:35

How do I calculate the derivative of a composition $R^{n} \rightarrow R^{n \times n} \rightarrow R^{n}$?

3 Answers3