What is intuition behind direction of derivative of a function?

Question

I can't quite grasp the concept of the why gradient of a function points in the direction of steepest ascent. Thinking about it lead me to the basic notions of derivative, but here example first:

$$f(x) = x^2$$

$$f'(x) = 2x$$ $$f'(-1) = -2, f'(1) = 2$$

Derivative basically says how much our function will change with respect to little change in its argument. I prefer to think about derivative as the velocity with which function changes in the given point.

So the absolute value of the derivative will tell us how much it will change. And the sign of derivative corresponds to the direction in which we should change the argument to get increase in the function(we can do $ x_{next} = x + hf'(x)$, where $h$ is a small number and we always will get increase in the value of a function)

So I do not understand why the derivative(or partial derivative, same stuff) always points in the direction in which we should change the argument to get increase in the function, given that value of derivative tells us the speed of a change of this function?

Relevant thread: https://math.stackexchange.com/questions/221968/why-must-the-gradient-vector-always-be-directed-in-an-increasing-direction — littleO, Apr 15 '18 at 08:19

Ant · Answer 1 · 2018-04-16T07:23:17.690

3

In multiple dimensions, if a function is differentiable, then you can prove that the partial derivative along a direction $v$ is

$$ D_v f = \nabla f \cdot v$$

(Note: for a partial derivative, we care about direction. The norm of the vector does not matter. To simplify the treatment, then, it is assumed that $\|v\| = 1$.) So you want to find the direction $v$ that maximizes that scalar product. It’s easy to see that it is maximized when $v$ and $\nabla f$ have the same direction, and since $v$ needs to have unit norm, we find that the only choice is $\displaystyle v= \frac{\nabla f}{\|\nabla f\|}$.

So you proved that the direction for which the directional derivative is maximized is the same as the direction of the gradient.

This of course also works in one dimension, but it’s less intuitive as there’s only only possible choice.

edited Apr 16 '18 at 07:23

answered Apr 15 '18 at 08:13

Ant

21,522

But we can easily throw in vectors that are not normalized. And then it is any direction. So why we take normalized v? – Il'ya Zhenin Apr 15 '18 at 08:30
@Il'yaZhenin Normalizing vectors does not change their direction, only changes the length. Direction is the same – Ant Apr 15 '18 at 09:36
1

yes but I mean point of finding a direction in this setup is to maximize dot product - dot product could be maximized changing the length of the vector. It this proof it is just assumed that gradient points in the direction of steepest ascent as I see it though I guess I am not seeing something – Il'ya Zhenin Apr 15 '18 at 21:53
1

@Il'yaZhenin I see. But partial derivatives are defined using normalized vectors; we want their norm to be $1$. The reason is simply that we don't care about their length, we are trying to find the rate of change along a direction. A direction can be represented with a vector, but then you should ignore the length of said vector because it does not influence the direction. I mean even with a longer vector, the local rate of change would be the same (depends only on the direction). So you should assume $v$ to have a specific norm, so that $v = \nabla f / |\nabla f| $ is the only possibiliy – Ant Apr 16 '18 at 07:21

score 1 · Answer 2 · 2018-04-18T18:51:11.367

1

I will try to show some kind of geometric intuition.

The derivative, in the one variable case, points to the direction where the function is changing as "time" increases.

In the multivariable case, assuming that $f:X\subset\Bbb R^n\to\Bbb R$, the derivative at $x_0$ define the tangent (hyper)plane at the point $(x_0,f(x_0))$

$$H(f(x_0)):=\{(x_0+h,f(x_0)+\nabla f(x_0)\cdot h)\in\Bbb R^{n+1}: h\in\Bbb R^n\}\tag1$$

of the (hyper)surface defined by the graph of $f$, $$G(f):=\{(x,f(x))\in\Bbb R^{n+1}:x\in X\}\tag2$$

As in the one variable case the (hyper)plane $H(f(x_0))$ approximates linearly (and optimally) $G(f)$ as $h\to 0$. This is the meaning of the derivative as the direction of the tangent at a point.

Now the basic intuition is that the tangent (hyper)plane at $(x_0,f(x_0))$ is defined locally by the changes of $G(f)$ around $(x_0,f(x_0))$. Then it makes sense that $\nabla f(x_0)$ points to some significative direction that reflects the way that the function changes in an (arbitrarily small) neighborhood of $x_0$.

Because it must holds that

$$\lim_{h\to 0}\frac{\|f(x_0+h)-f(x_0)-\nabla f(x_0)\cdot h\|}{\|h\|}=0$$

Then the direction of $\nabla f(x_0)$ can be thought as the direction of instantaneous maximum change of $f$.

Other way to think about it: the gradient is defined by the partial derivatives of the function at a point. Each partial derivative act as a derivative of one variable, and all partial derivatives are just directional derivatives that are orthogonal one to each other.

And all partial derivatives together in a matrix form represent the derivative of a function at a point. In the case of a functional (a functional is a function from a vector space to it field) this matrix is just a vector named the gradient.

edited Apr 18 '18 at 18:51

answered Apr 15 '18 at 09:08

To clarify my understanding of your answer, the $h$ in the limit is a vector, and $∇f(x0)⋅h$ is the dot product of gradient at the point in the hypersurface with vector, which is directional derivative. Which is a scalar. Why then we also take it's length $∥∇f(x0)⋅h1∥$? And the point that proves relation $∥∇f(x0)⋅h1∥≥∥∇f(x0)⋅h2∥$ is that left side and right side are at their maximum when direction of gradient with $h_1$ and $h_2$ goes along? – Il'ya Zhenin Apr 18 '18 at 17:03
And also I do not get why "Then the direction of $∇f(x0)$ can be thought as the "direction of instantaneous maximum change of $f"$" . Why it is the change of second derivative, isn't gradient the change of just a function $f$? – Il'ya Zhenin Apr 18 '18 at 17:06
And one more:) $∥f(x0+h)−f(x0)−∇f(x0)⋅h∥$ should be zero because $∇f(x0)⋅h$ linearly approximates the function in neighborhood of that point, ideally making it equal to $f(x0+h)−f(x0)$? But why again find the length $∥f(x0+h)−f(x0)−∇f(x0)⋅h∥$, when the thing inside is a scalar(presumed that $f$ maps input vector to scalar), and why $∥h∥$ in the denominator, shoudn't nominator tend to zero by itself? – Il'ya Zhenin Apr 18 '18 at 17:13
@Il'yaZhenin for your first comment: the limit is the most common definition of derivative on multivariable calculus. Take a look here. The derivative is the unique (when exist) instantaneous linear approximation of the function. It is defined by the above limit (except the case of some more complex notions of derivatives that are not linear approximations, but they are not generally used). You need a bit of context, as I said the gradient is the simplest case of derivative on vector calculus, think of it as a matrix of 1 column. – Apr 18 '18 at 17:31
sorry, in my previous comment the used standard is the matrix multiplication by the right, so I must said a matrix of 1 file instead of a matrix of 1 column. – Apr 18 '18 at 17:41
also take a look here or take a look to any introductory book on multivariable calculus. – Apr 18 '18 at 18:05
I know that, but did not understood how $h$ can tend to zero(scalar), while in the limit you treat it as a vector. Wait, are you using $∥h∥$ as the norm of the vector or its absolute value(if $h$ is a scalar)? – Il'ya Zhenin Apr 18 '18 at 18:35
$h$ is not a scalar, it is a vector. In all the above $h$ is a vector. Observe that $x_0,h\in\Bbb R^n$, as I have defined $f$. Also I edited the use of "", what seems that confused you about reading $f''$ instead of just $f$ at the end of the last phrase. – Apr 18 '18 at 18:49
oh, I see your confusion: every vector space have a zero, so there is no confusion in writting $0$ instead of $(0,0,0,0,0,...,0)$, that is $0\in\Bbb R^n$ is the zero vector in $\Bbb R^n$. – Apr 18 '18 at 18:55

score 0 · Answer 3 · answered Apr 16 '18 at 11:49

I'll write down here my thoughts that I am ending up with, may be someone will find it useful. Thanks for everyone who answered!

So as as I see it which is kind of mind boggling - gradient is not a value of a function, neither the argument of a function, it is a relation of both. But gradient gets treated like it is a value of a function in a way that it is placed in the same space where the value of a function "lives" as a vector. This is what made me confused for a while, because I was trying to connect it intuitively with both - change in a function and a change in an argument.

And for a while I thought of derivative as a speed of a change of a function(Feynman's lectures), which has it is full intuitive meaning for physical equations because change in time is only positive there - we cannot go back in time(i.e. if a function depends on time we cannot adjust time parameter decreasing it to get increase in a function), so thinking in this fashion do not rises questions which appear in more common place.

Mu current intuition - given single parameter of a multivariable function we take partial derivative of this function with respect to single parameter and we get rate of a change of a function with respect to a little change in its parameter. If $f(x + \Delta x) - f(x)$ is positive, then function increases give a positive change in its parameter, if it is negative then in decreases given positive change in its parameter. So as will the ratio $(f(x + \Delta x) - f(x)) / \Delta x$ .

Then as it is ratio positive, we can say that increase in a $x$ gives increase in a function value, if ratio is negative then increase $x$ gives decrease in a function value. So $sign((f(x + \Delta x) - f(x)) / \Delta x)$ can be used as a direction in which our parameter $x$ needs to be adjusted to get increase in a function. We basically have only 2 directions in which we can adjust our parameter $x$ value, so it is ether gets increase or decrease in a function(or in some special cases neither I guess, for example $f(x) = 1 + x - x$), and it holds true for every parameter in a multivariable function.

So for $f(x_1,x_2,...,x_n)$ we will get a bunch of directions, $\frac{ \partial f(x_1,x_2,...,x_n)}{\partial x_1},\frac{ \partial f(x_1,x_2,...,x_n)}{\partial x_2},,...,\frac{ \partial f(x_1,x_2,...,x_n)}{\partial x_n}$ and each of them gives direction for a change in a parameter to get increase in a function. Combined in a vector, it tells us direction in a parameter space which we should move to get increase in a function. And in is the stepeest direction because each of our single parameter derivatives points in a direction of increase(and the second and only direction for a single parameter is a decrease in a function value), so by that how partial derivative is defined we get that direction of steepest ascent.

Though I do not see how yet dot product comes in handy.

It sounds and looks quite simple now though it is hard for me to grasp, hope someone might find it helpful.

Dot product appear because the derivative at a point is a function more than a value. However in the one variable case there is the isomorphism $f'(x)\mapsto f'(x)\cdot 1$, this mean that we can think of derivatives as values. The (Fréchet) derivative of a function is defined as the best local linear approximation of the function. "Local" in mathematics have a precise meaning: neighborhoods (=euclidean topology). That is: continuity, differentiability, etc... are local properties of a function. Then the Fréchet derivative of a functional (a functional is a function from... — , Apr 16 '18 at 15:38
... a vector space to it field, some of the kind $f:\Bbb R^n\to\Bbb R$, can be represented using a dot product, that is, if $\partial f(x)$ represent the Fréchet derivative of $f$ at $x$ (that is a linear function) then $$\partial f(x)v=(\nabla f(x)\mid v)$$ where $(\cdot|\cdot)$ is a dot product (the euclidean dot product) generally represented just by a point: $\nabla f(x)\cdot v$. In general a Fréchet derivative who domain have finite dimension have a matrix representation and it action is represented by matrix multiplication. In the simple case of a functional... — , Apr 16 '18 at 15:41
... it matrix representation is just a (column) vector, named the gradient of $f$ at a point. — , Apr 16 '18 at 15:44
@Masacroso While I think you're explanation is nice, it's probably over the OP's head. Talking of topology of frechet derivatives cannot be good pedalogical tools when clearing up some misunderstanding about gradients — Ant, Apr 16 '18 at 15:50
"Though I do not see how yet dot product comes in handy." - it comes in handy simply because the formula $D_v f = \nabla f \cdot v$ is correct. That's literally the only justification of why the dot product matters. And once you accept the formula above, then it's easy to understand why the gradient is the direction of maximum steepness — Ant, Apr 16 '18 at 15:52
@Ant okay now I got it, the directional derivative is the change of a function taken in the arbitrary vector direction(the direction of gradient is not implied as I thought), and it is maximized by definition and the properties of dot product. But I wish there was a proof for such a more intuitive form of a directional derivative $\nabla _\vec{v}$ — Il'ya Zhenin, Apr 18 '18 at 17:16
@Ant But I wish there was a proof for such a more intuitive form of a directional derivative $\nabla _\vec{v} f (\vec{a}) = \lim \frac{f(\vec{a} + h \vec{v}) - f(\vec{a})}{h}$ - I think it is needed to show that $\vec{v}$ is a unit vector which points in the positive direction along each axis? — Il'ya Zhenin, Apr 18 '18 at 17:23
@Il'yaZhenin the only assumption on $v$ is that is has unit norm — Ant, Apr 20 '18 at 17:27

What is intuition behind direction of derivative of a function?

3 Answers3