I'm using the Softmax function as the activation function for the last layer of a neural network I am trying to code up. The function takes in a vector of elements, $\vec{z}$, where the length of $\vec{z}$ is $L$. The function returns the activation values of the neurons or a vector $\vec{a}$ also of length $L$. The function is:
\begin{equation} a_j = \dfrac{e^{z_j}}{\sum_{k=1}^{L}e^{z_k}} \end{equation}
$a_1$ would be the first element in $\vec{a}$, then $a_2$, until the last element is $a_L$. The values from $\vec{z}$ are exponentiated as to reduce their possible range to $(0,\infty)$. Dividing each of the exponentiated elements by the sum of all the exponentiated elements would then return values in a range of $(0,1)$.
Since $a_j$ is dependent on all the values of $\vec{z}$ -- the summation in the denominator includes all values of $\vec{z}$ -- we could think of $a_j$ as a function of each of the elements in $\vec{z}$, i.e. $a_j(z_1, z_2, z_3, ..., z_L)$. With this we can take a partial derivative with respect to one of the elements of $\vec{z}$, i.e. $\dfrac{\partial a_j}{\partial z_i}$ where $z_i$ is an arbitrary element of $\vec{z}$.
It can be shown that this derivative is dependent on whether $i=j$. This page describes the derivation and shows that:
\begin{equation} \dfrac{\partial a_j}{\partial z_i} = a_j \left( \delta_{ij} - a_i\right) \end{equation} where \begin{equation} \delta_{ij} = \left\{ \begin{array}{lr} 1, & \text{if } i = j\\ 0, & \text{if } i\ne j \end{array} \right\} \end{equation}
My question is about subtracting the maximum of the vector before doing the Softmax function. If $\max(\vec{z}) = \hat{z}$, then the Softmax function would look like:
\begin{equation} a_j = \dfrac{e^{z_j - \hat{z}}}{\sum_{k=1}^{L}e^{z_k-\hat{z}}} \end{equation}
The reasoning behind doing this is to eliminate the possibility of an overflow error when exponentiating $z_j$. By subtracting the largest element from the vector from each of the elements, the largest value would now be $0$ and all other elements would be negative. When exponentiated this would then produce a possible range of $(0,1]$.
No resources that I found have showed that this subtraction preserves the derivative. I tried to derive this myself but got stuck when noticing that $\hat{z}$ is a function of the elements of $\vec{v}$, i.e. $\hat{z} = \max(z_1, z_2, z_3, ..., z_L)$. This in turn, allows a partial derivative to be taken such as $\dfrac{\partial \hat{z}}{\partial z_i}$. I would like to understand how the original partial derivative $\dfrac{\partial a_j}{\partial z_i}$ is derived when subtracting the maximum value of the vector from each element.
One thing that could maybe work is to do the chain rule. If we let $\overset{*}{z_i} = z_i-\hat{z}$, we could use the chain rule:
\begin{equation} \dfrac{\partial a_j}{\partial z_i} = \dfrac{\partial a_j}{\partial \overset{*}{z_i}} \dfrac{\partial \overset{*}{z_i}}{\partial z_i} \end{equation}
This last chain rule is something that I found tricky to solve, and it might not even be a correct way of approaching this change to the Softmax derivative.