We have seen that the momentum hyper-parameter enables us to temporally smooth out the gradient samples obtained through stochastic gradient descent. What is/are the main advantage(s) of this technique? What effect does increasing the momentum hyper-parameter β (in other words, increasing the weight of previously sampled gradients) have on the model’s capacity?


1 Response to “Momentum”

  1. 1 Geoffroy MOURET January 31, 2013 at 20:41

    By decreasing the noise and oscillations induced by the stochastic gradient descent, the use of momentum helps to increase the convergence speed.
    However, when using a momentum, one should be careful not to set it too high. Indeed, in the case of high momentum, the corrected gradient can pass through the minimum and the gathered momentum may take some time to vanish and go back to the target, increasing computation time.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: