You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/13 12:45:18 UTC

[GitHub] [incubator-mxnet] wkcn opened a new issue #15533: The problems about SGD with momentum when learning rate changes

wkcn opened a new issue #15533: The problems about SGD with momentum when learning rate changes
URL: https://github.com/apache/incubator-mxnet/issues/15533

Hi, there.
Currently, SGD(Stochastic Gradient Descent) in MXNet is applied by:
```
rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state
```
I found there are two problems in SGD.
1. Loss of accuracy on float-point number
For the SGD with momentum, the variable `state` stores the gradients multiplied by learning rate. However, learning rate is usually a small value, such as 1e-3, enabling the `state` becomes smaller than the gradient. It may loss the accuracy.

2. The case when learning rate changes.
When learning rate changes, the variable `state` stores the gradients multiplied by old learning rate. It is wrong.

Solution:
We should update the implement of SGD with momentum, but we should consider the compatibility with old optimizer states.
```
rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + rescaled_grad
weight = weight - lr * state
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services