1

In a subfield of machine learning called "self-supervised learning", many methods constrain network representations to lie on the hypersphere. I want to understand how such representations relate to matrix norms.

Suppose I sample $N$ points $x_1, \ldots, x_N \in \mathbb{R}^D$ uniformly at random from the $D$-dimensional hypersphere i.e. $x_n^T x_n = 1$, and I then stack those $N$ points as rows in a matrix $X \in \mathbb{R}^{N \times D}$.

I'm specifically interested in the regime $N > D >> 1$.

What is the expected nuclear norm of $X$ as a function of $N$ and $D$? And the variance of the nuclear norm? The nuclear norm is defined as the sum of the singular values of $X$:

$$||X||_* := \sum_r^{\text{rank}(X)} \sigma_r(X) = Tr[\sqrt{M^T M}]$$

Update: A friend helped me derive an intuitive sketch. In high D, $\mathcal{N}(0, 1/D)$ concentrates on a sphere. So we can think of uniform samples on the sphere as (approximately) coming from this Gaussian distribution. This is helpful because all the coordinates are now independent. Consider $M^T M$. Each of the diagonal terms will be the sum of $N$ quantities with variance $1/D$, for a total of $N/D$. Each of the off-diagonal terms will be close to orthogonal. So $M^T M$ will be $\approx (N/D) I_{D \times D}$ and then the matrix square root will be $\approx \sqrt{N/D} I_{D \times D}$ and its trace will be $D \sqrt{N/D} = \sqrt{ND}$.

  • It is incumbent on Question authors to include some context for a problem statement. The title asks for an upper bound, but the body asks for a sharp upper bound by calling for "the largest possible nuclear norm". Why the bound is needed and why the "nuclear norm" is important to you would provide suitable context for Readers to respond in a useful way. – hardmath Aug 30 '23 at 16:40
  • I added motivation and, upon reflecting on your feedback, changed the question slightly because I now better understand what I want. – Rylan Schaeffer Sep 02 '23 at 20:40
  • @hardmath I added a sketch of the answer, but I need help making it tight. If you could reopen the question, I would appreciate it! – Rylan Schaeffer Sep 03 '23 at 00:01
  • Help me understand what you intend to resolve. As sketched, matrix $X$ is randomly sampled to have $N$ rows of length $D$, each with unit Euclidean norm. Now $N,D$ don't generally determine the nuclear norm of $X$. Perhaps you have in mind finding the expected value of that nuclear norm? Special cases $N=1$ and $D=1$ are apparently not of interest to you, based on your interest in "self-supervised [machine] learning". – hardmath Sep 03 '23 at 01:33
  • Correct, the regime I'm interested in is N > D >> 1. I'm interested in both the expected value and the variance. Would you like to update the question or should I? – Rylan Schaeffer Sep 03 '23 at 03:13
  • 1
    It seems your update fits your Comment. – hardmath Sep 03 '23 at 03:41
  • Searching for expected value of trace norm (a synonym for the nuclear norm) led me to this paper which provides the limiting distribution of "the trace norm of a random matrix" in Sec. 4.1. I suspect the methods described there will shed light on your Question. – hardmath Sep 04 '23 at 15:07

0 Answers0