You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by ha...@apache.org on 2019/01/04 21:20:11 UTC
[incubator-mxnet] branch master updated: Update Adam optimizer documentation (#13754)

This is an automated email from the ASF dual-hosted git repository.

haibin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git


The following commit(s) were added to refs/heads/master by this push:
     new 6a4bac6  Update Adam optimizer documentation (#13754)
6a4bac6 is described below

commit 6a4bac6a96fabb76625a68edb21a194040efa836
Author: Haibin Lin <li...@gmail.com>
AuthorDate: Fri Jan 4 13:19:53 2019 -0800

    Update Adam optimizer documentation (#13754)
---
 python/mxnet/optimizer/optimizer.py | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/python/mxnet/optimizer/optimizer.py b/python/mxnet/optimizer/optimizer.py
index ba16132..d290a3f 100644
--- a/python/mxnet/optimizer/optimizer.py
+++ b/python/mxnet/optimizer/optimizer.py
@@ -1030,13 +1030,14 @@ class Adam(Optimizer):
     Stochastic Optimization*, available at http://arxiv.org/abs/1412.6980.
 
     If the storage types of grad is ``row_sparse``, and ``lazy_update`` is True, \
-    **lazy updates** are applied by::
+    **lazy updates** at step t are applied by::
 
         for row in grad.indices:
             rescaled_grad[row] = clip(grad[row] * rescale_grad + wd * weight[row], clip_gradient)
             m[row] = beta1 * m[row] + (1 - beta1) * rescaled_grad[row]
             v[row] = beta2 * v[row] + (1 - beta2) * (rescaled_grad[row]**2)
-            w[row] = w[row] - learning_rate * m[row] / (sqrt(v[row]) + epsilon)
+            lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
+            w[row] = w[row] - lr * m[row] / (sqrt(v[row]) + epsilon)
 
     The lazy update only updates the mean and var for the weights whose row_sparse
     gradient indices appear in the current batch, rather than updating it for all indices.
@@ -1044,12 +1045,13 @@ class Adam(Optimizer):
     throughput for some applications. However, it provides slightly different semantics than
     the original update, and may lead to different empirical results.
 
-    Otherwise, **standard updates** are applied by::
+    Otherwise, **standard updates** at step t are applied by::
 
         rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
         m = beta1 * m + (1 - beta1) * rescaled_grad
         v = beta2 * v + (1 - beta2) * (rescaled_grad**2)
-        w = w - learning_rate * m / (sqrt(v) + epsilon)
+        lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
+        w = w - lr * m / (sqrt(v) + epsilon)
 
     This optimizer accepts the following parameters in addition to those accepted
     by :class:`.Optimizer`.