In a subfield of machine learning called "self-supervised learning", many methods constrain network representations to lie on the hypersphere. I want to understand how such representations relate to matrix norms.
Suppose I sample $N$ points $x_1, \ldots, x_N \in \mathbb{R}^D$ uniformly at random from the $D$-dimensional hypersphere i.e. $x_n^T x_n = 1$, and I then stack those $N$ points as rows in a matrix $X \in \mathbb{R}^{N \times D}$.
I'm specifically interested in the regime $N > D >> 1$.
What is the expected nuclear norm of $X$ as a function of $N$ and $D$? And the variance of the nuclear norm? The nuclear norm is defined as the sum of the singular values of $X$:
$$||X||_* := \sum_r^{\text{rank}(X)} \sigma_r(X) = Tr[\sqrt{M^T M}]$$
Update: A friend helped me derive an intuitive sketch. In high D, $\mathcal{N}(0, 1/D)$ concentrates on a sphere. So we can think of uniform samples on the sphere as (approximately) coming from this Gaussian distribution. This is helpful because all the coordinates are now independent. Consider $M^T M$. Each of the diagonal terms will be the sum of $N$ quantities with variance $1/D$, for a total of $N/D$. Each of the off-diagonal terms will be close to orthogonal. So $M^T M$ will be $\approx (N/D) I_{D \times D}$ and then the matrix square root will be $\approx \sqrt{N/D} I_{D \times D}$ and its trace will be $D \sqrt{N/D} = \sqrt{ND}$.