2

Question

Layers: We shall denote in the following the layer number by the upper script $\ell$. We have $\ell=0$ for the input layer, $\ell=1$ for the first hidden layer, and $\ell=L$ for the output layer. The number of neurons in the layer $\ell$ is denoted by $d^{(\ell)}$. In particular, $d^{(0)}$ is the number of inputs and $d^{(\ell)}$ is the number of outputs.

Weights: The system of weights is denoted by $w_{ij}^{(\ell)}$, with $1\leq l\leq L,\ 0\leq i\leq d^{\left( \ell-1\right) },1\leq j\leq d^{\left( \ell\right) }$. The weights has an upper bound which is denoted by $w^{\left( \ell\right) }_{ij}< \hat{w}$. The weight $w_{ij}^{(\ell)}$ is associated with the edge that joins the ith neuron in the layer $\ell-1$ to the jth neuron in the layer $\ell$. The weights $w^{\left( \ell\right) }_{0j}=b^{\left( \ell\right) }_{j}$ are regarded as biasses. They are the weights corresponding to the fake input $x^{\left( \ell\right) }_{0}=1$.

Inputs and outputs: The inputs to the network are denoted by $x_j^{(0)}$, with $1\leq j\leq d^{\left( 0\right) }$. We denote by $x_j^{(\ell)}$ the output of the jth neuron in the layer. Consequently, $x_j^{(L)}$ is the network output, with $1\leq j\leq d^{\left( L\right) }$. The notation $x_0^{(L)} = 1$ is reserved for the fake input linked to the bias corresponding to the neurons in the layer $\ell+1$.

Consider the jth neuron in the layer $\ell$. In its first half, the neuron collects the information from the previous layer, as a weighted average, into the signal $$ s_j^{(\ell)}=\sum_{i=0}^{d^{(\ell-1)}} w_{i j}^{(\ell)} x_i^{(\ell-1)}=\sum_{i=1}^{d^{(\ell-1)}} w_{i j}^{(\ell)} x_i^{(\ell-1)}+b_j^{(\ell)} $$ In its second half, the neuron applies an ReLu function to the previous signal and outputs the value $$ x_j^{(\ell)}=\text{ReLu}\left(\sum_{i=1}^{d^{(\ell-1)}} w_{i j}^{(\ell)} x_i^{(\ell-1)}+b_j^{(\ell)}\right) $$ When $\ell = L$,we have $$ x^{\left( L\right) }_{j}=\sum^{d^{(L-1)}}_{i=1} w^{(L)}_{ij}x^{(L-1)}_{i}+b^{(L)}_{j} $$ which means $x_j^{(L)}=s_j^{(L)}$.

This can be written in the following equivalent matrix form: $$ X^{(\ell)}=\text{ReLu}\left(W^{(\ell)^T} X^{(\ell-1)}+B^{(\ell)}\right) $$ $$ X^{(L)}=W^{(L)^{T}}X^{(L-1)}+B^{(L)} $$ where we used the notations $$ X^{(\ell)}=\left(x_1^{(\ell)}, \ldots, x_{d^{\ell}}^{(\ell)}\right)^T, \quad W^{(\ell)}=\left(w_{i j}^{(\ell)}\right)_{i, j}, \quad B^{(\ell)}=\left(b_1^{(\ell)}, \ldots, b_{d^{\ell}}^{(\ell)}\right)^T $$ I need to calculate the upper bound of the gradient of this ReLu neural network, which is related to $d^{\ell}$ and the number of layers, denoted by $L$. Specifically, I need to find a constant, $G$, that satisfies: $$ \parallel \nabla_{W } f\left( x^{\left( 0\right) },W \right) \parallel_2 \leq G $$ Where $W$ denotes all the parameters in this neural network since $w^{\left( \ell\right) }_{0j}=b^{\left( \ell\right) }_{j}$ and $f = X^{(L)}$

Answer

To find the upper bound of the gradient of the ReLu neural network, we need to first calculate the derivative of the ReLu activation function. The ReLu function is defined as:

$$ \text{ReLu}(x) = \begin{cases} x, &\text{ if } x \geq 0 \ 0, &\text{ otherwise} \end{cases} $$

Its derivative is given by:

$$\frac{\partial \text{ReLu}(x)}{\partial x} = \begin{cases} 1, &\text{ if } x \geq 0 \ 0, &\text{ otherwise} \end{cases}$$

Now, let $f(x,W)$ denote the output of the neural network with weights $W$ and input $x$. Then, using the chain rule, we can compute the gradient of $f$ with respect to $W$ as:

$$\begin{aligned} \nabla_W f(x,W) &= \left(\frac{\partial f}{\partial w_{ij}^{(\ell)}}\right){1\leq \ell\leq L,0\leq i\leq d^{\left( \ell-1\right) },1\leq j\leq d^{\left( \ell\right) }} \ &= \left(\frac{\partial f}{\partial x_j^{(\ell)}} \frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} \frac{\partial s_j^{(\ell)}}{\partial w{ij}^{(\ell)}}\right)_{1\leq \ell\leq L,0\leq i\leq d^{\left( \ell-1\right) },1\leq j\leq d^{\left( \ell\right) }} \ \end{aligned}$$

where we have used the fact that $\frac{\partial f}{\partial x_j^{(\ell)}}$ is zero for all $j$ and $\ell$ except for $j=1$ and $\ell=L$, since $f=X^{(L)}$.

We can further simplify this expression using the matrix form of the neural network:

$$\begin{aligned} \nabla_W f(x,W) &= \left(\frac{\partial f}{\partial x_j^{(\ell)}} \frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} \frac{\partial s_j^{(\ell)}}{\partial w_{ij}^{(\ell)}}\right){1\leq \ell\leq L,0\leq i\leq d^{\left( \ell-1\right) },1\leq j\leq d^{\left( \ell\right) }} \ &= \left(\text{diag}\left(\left[\frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}}\right]{1\leq j\leq d^{(\ell)}}\right) X^{(\ell-1)T}\right){1\leq \ell\leq L,1\leq j\leq d^{(\ell)},0\leq i\leq d^{\left( \ell-1\right) }} W{ij}^{(\ell)} \ &= \left(\text{diag}\left(\left[\frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}}\right]{1\leq j\leq d^{(\ell)}}\right) X^{(\ell-1)T} W^{(\ell)}\right){1\leq \ell\leq L,1\leq j\leq d^{(\ell)},0\leq i\leq d^{\left( \ell-1\right) }} \ \end{aligned}$$

where $\text{diag}(v)$ is a diagonal matrix with diagonal entries given by the vector $v$.

To find the norm of this gradient, we need to bound the magnitude of each element in it. Let $G_{ij}^{(\ell)}$ denote the magnitude of the $(i,j,\ell)$-th element of the gradient matrix. Then, we have:

$$|G_{ij}^{(\ell)}| = \left|\frac{\partial f}{\partial x_j^{(\ell)}} \frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} \frac{\partial s_j^{(\ell)}}{\partial w_{ij}^{(\ell)}}\right|$$

Using the formula for $s_j^{(\ell)}$ and the fact that $\frac{\partial f}{\partial x_j^{(\ell)}} = 1$ for $\ell=L$ and $j=1$, we get:

$$|G_{ij}^{(\ell)}| = \left|\frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} x_i^{(\ell-1)}\right|$$

We know that for the ReLu activation function, $\frac{\partial x_j^{(\ell)}}{\partial s_j^{(\ell)}} = \text{ReLu}'(s_j^{(\ell)})$. Therefore, we can bound $|G_{ij}^{(\ell)}|$ as:

$$|G_{ij}^{(\ell)}| \leq \left{\begin{array}{ll} 0, &\text{ if } s_j^{(\ell)}< 0 \ |x_i^{(\ell-1)}|, &\text{ if } s_j^{(\ell)}\geq 0 \end{array}\right.$$

Now, let $N = \sum_{\ell=1}^{L} d^{(\ell)}$ denote the total number of neurons in the network. Then, we can bound the 2-norm of the gradient as:

$$\begin{aligned} |\nabla_W f(x,W)|2 &= \sqrt{\sum{\ell=1}^{L}\sum_{i=0}^{d^{(\ell-1)}}\sum_{j=1}^{d^{(\ell)}} |G_{ij}^{(\ell)}|^2} \ &\leq \sqrt{\sum_{\ell=1}^{L}\sum_{i=0}^{d^{(\ell-1)}}\sum_{j=1}^{d^{(\ell)}} |x_i^{(\ell-1)}|^2} \ &= \sqrt{\sum_{i=1}^{d^{(0)}}\sum_{j=1}^{d^{(1)}} |x_i^{(0)}|^2+\sum_{\ell=2}^{L}\sum_{i=1}^{d^{(\ell-1)}}\sum_{j=1}^{d^{(\ell)}} |x_i^{(\ell-1)}|^2} \ &= \sqrt{N}\cdot\max\left{\sqrt{|x_i^{(0)}|^2}, \sqrt{\sum_{\ell=2}^{L}\sum_{i=1}^{d^{(\ell-1)}}\sum_{j=1}^{d^{(\ell)}} |x_i^{(\ell-1)}|^2}\right} \ &= \sqrt{N}\cdot\max\left{1, \sqrt{\sum_{\ell=2}^{L} d^{(\ell-1)}d^{(\ell)}}\right}\ &\leq \sqrt{N}\cdot\max\left{1, \sqrt{d^{(1)}d^{(2)}}\right}\ \end{aligned}$$

Therefore, we can choose $G = \sqrt{N}\max{1,\sqrt{d^{(1)}d^{(2)}}}$ as an upper bound for the 2-norm of the gradient.

0 Answers0