I have seen several proof of Schwartz Kernel Theorem, using different techniques. Some (such as Melrose's proof in his notes on microlocal analysis) use the representations of $\mathcal{S}(\mathbb{R}^n)$ and $\mathcal{S}'(\mathbb{R}^n)$ in terms of weighted Sobolev spaces, others (such as the proof in Duistermaat and Kolk) use the Fourier transform, others (such as the one in Friedlander and Joshi) use Fourier series.
I can follow these proofs, but I feel I don't really understand them, in that I don't understand what fundamental properties of the space of distributions make them work.
I see that there are similarities: for example, the last two approaches use some sort of representation of test functions on $X\times Y$ into sums of tensor products of test functions on $X$ and $Y$.
I found this remark in an old paper of Ehrenpreis (On the Theory of Kernels of Schwartz, Proceedings of the American Mathematical Society, Vol. 7, No. 4 (Aug., 1956), pp. 713-718):
Lemma 1 is the only part of the proof of Theorem 1 [the kernel theorem] that uses special properties of the space $\mathcal{D}$ and, in fact, the analog of Theorem 1 [the kernel theorem] holds for (essentially) all function spaces for which an analog of Lemma 1 can be found.
Lemma 1 is the following
Let $B$ be a bounded set in $\mathcal{D}(\mathbb{R}^n\times\mathbb{R}^n)$. Then we can find a bounded set $B'\subset\mathcal{D}(\mathbb{R}^n)$ and a $b>0$ so that every $f\in B$ can be written in the form $\sum_i \lambda_ig_i\otimes h_i$ where $\sum_i|\lambda_i|<b$, and $g_i, h_i\in B'$, and where the series converges in $\mathcal{D}(\mathbb{R}^n\times\mathbb{R}^n)$.
The remark would suggest that the key point really is being able to decompose test functions on $X\times Y$ into sums of tensor products of test functions on $X$ and $Y$, but I still don't see why this should be the case.
I also read that the theory of Nuclear Spaces proves an abstract kernel theorem, generalising the usual statement for distributions. I assume this implies being able to extract the fundamental properties that make the kernel theorem work, but I found no short and essential exposition of the theory, or one which does not require extensive prerequisites.
So, my questions are:
- How do people who understand the kernel theorem think about its proof?
- What are the fundamental ingredients that make it work?
- I understand why it is important, but why is it so surprising that every continuous linear map $\mathcal{D}\rightarrow\mathcal{D}'$ is given by a kernel?