Why is the function derivative used to calculate the Local Minimum instead of the actual function?

Question

Why is the function derivative used to calculate the Local Minimum instead of the actual function?

In a machine learning regression problem, why is the local minimum calculated for the derivative of the function instead of the actual function?

Example: http://en.wikipedia.org/wiki/Gradient_descent

The gradient descent algorithm is applied to find the local minimum of the function $$

f(x)=x^4−3x^3+2, ----(A)

with derivative

f'(x)=4x^3−9x^2. ----(B)

Here, to find the local minimum using the gradient descent algorithm for function (A), they used the derived function (A), which is function (B).

+3

machine-learning differentiation linear-regression gradient-descent derivative

user703555 12 Feb At 9:57 am

source to share

2 answers

I am not a mathematician, so I cannot give you an exact answer, however you need to understand what derivation is, for example:

http://en.wikipedia.org/wiki/Derivative http://en.wikipedia.org/wiki/Differential_of_a_function

this is what you want (which is what differentiation does): http://en.wikipedia.org/wiki/File:Graph_of_sliding_derivative_line.gif

The derivative at a point is equal to the slope of the tangent line to the graph of the function at that point. And this is exactly what you need when you look at the descent. Take this as a very informal point of view, Wikipedia articles will give you much deeper and more accurate knowledge ...

+1

xhudik 12 Feb At 10:21

source to share

Ben allison · Accepted Answer · 2013-02-12T11:11:13+0000

The reason is that because the function is concave (or convex if you are maximizing --- the problems are equivalent), you know there is one minimum (maximum). This means that there is only one point where the gradient is zero. There are methods that use this function, but if you can compute the gradient, you can converge much faster because you can think of a gradient giving you information about how far from optimal you are.

Like Gradient Descent, there is an optimization method known as Newton's method , which requires the calculation of the second derivative (Hessian in oscillation multi-optimization). This converges even faster, but requires you to be able to invert the Hessian, which is not possible if you have many parameters. Thus, there are ways to get around this, which calculate the limited application of the Hessian memory . These methods converge faster because they use information about the curvature of the gradient: this is a simple tradeoff, where the more you know about the function you are trying to optimize, the faster you can find a solution.

Why is the function derivative used to calculate the Local Minimum instead of the actual function?

More articles: