6

Look at the Wiki page for Softmax function (section "Smooth approximation of maximum"): https://en.wikipedia.org/wiki/Softmax_function

It is saying that the following is a smooth approximation to the softmax: $$ \mathcal{S}_{\alpha}\left(\left\{x_i\right\}_{i=1}^{n}\right) = \frac{\sum_{i=1}^{n}x_i e^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}} $$

  • Is it an approximation to the Softmax?

    • If so, Softmax is already smooth; why do we create another smooth approximation?

    • If so, how do derive it from Softmax?

  • I don't see why this might be better than Softmax for gradien descent updates.

Daniel
  • 2,760
  • 1
    I am confused by this as well, but let me make a comment that might be useful. A smooth approximation of maximum that I am familiar with is $f(x,\alpha):=\alpha^{-1} \log\left(\sum_i e^{\alpha x_i}\right)$ which is always within an additive $(\log n)/\alpha$ from the maximum. The function in your question is $\partial f/\partial\alpha$. – Sasho Nikolov May 18 '15 at 04:46
  • Out of curiosity, what problem are you solving? Are you sure you can't just use the max function? Many convex optimization algorithms can handle nondifferentiable objective functions. – littleO Jul 18 '15 at 05:13

1 Answers1

4

This is a smooth approximation of maximum function:

$$ \max\{x_1,\dots, x_n\} $$

where $\alpha$ controls the "softness" of the maximum. The detailed explanation is available here: http://www.johndcook.com/blog/2010/01/13/soft-maximum/

Softmax is better then maximum, because it is smooth function, while $\max$ is not smooth and does not always have a gradient.

coffee
  • 141
  • Could you perhaps sum up the detailed discussion in your answer? – Ali Caglayan Sep 24 '14 at 22:21
  • The sum up is the first sentence of my reply, in the explanation there are also visual examples given. – coffee Sep 25 '14 at 23:26
  • 1
    to be short, $\alpha \to +\infty$ makes softmax converge to max, and $\alpha \to -\infty$ makes softmax converge to min – coffee Sep 26 '14 at 02:53
  • The blog entry discusses a different function from the one in the question. – Sasho Nikolov May 18 '15 at 04:42
  • yes, the function described there is a little different (it is numerically better I think, this is the one you've mentioned), but the idea is the same. Both are valid approximations. – coffee May 21 '15 at 08:10
  • @coffee, I think the formula in the first post is different from the one in the link of johndcook's blog. If I want two parameters to control the smooth connection, one parameter is the starting point, the other parameter is the smoothing curve center point, is there any soft maximum that achieves this goal? – user1914692 Oct 01 '15 at 15:57
  • there is a problem that I can't fix it with this is smooth max, when ever the the alpha is height in order to make it close the hard max and keeping the differential part of the function the result get inf in any compiler. this due to the high number in the log. Do you have any solution for this to make at least hard max ~= smooth max – Feras Oct 21 '17 at 21:42
  • @coffee I wouldn't necessarily agree that softmax is "better than" max, just in some cases :) – AB_IM Sep 03 '18 at 15:50