You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2019/11/12 08:37:09 UTC
[GitHub] [singa] chrishkchris opened a new pull request #560: SINGA-487
Accumulate gradients to reduce network latency
chrishkchris opened a new pull request #560: SINGA-487 Accumulate gradients to reduce network latency
URL: https://github.com/apache/singa/pull/560
This PR reduces the network latency by accumulate gradients in a memory buffer before sending out with NCCL.
This can reduce much of the TCP/IP latency by reducing the number of NCCL API call.
Together with the result of PR #555, here is a simple test to make sure the training is correct:
```
ubuntu@ip-172-31-26-214:~/singa/examples/autograd$ python3 mnist_multiprocess.py
Starting Epoch 0:
Training loss = 831.072205, training accuracy = 0.700454
Evaluation accuracy = 0.927015, Elapsed Time = 0.676089s
Starting Epoch 1:
Training loss = 248.684601, training accuracy = 0.916183
Evaluation accuracy = 0.958265, Elapsed Time = 0.545179s
Starting Epoch 2:
Training loss = 172.330597, training accuracy = 0.943042
Evaluation accuracy = 0.967928, Elapsed Time = 0.543617s
Starting Epoch 3:
Training loss = 139.254807, training accuracy = 0.953425
Evaluation accuracy = 0.973067, Elapsed Time = 0.530805s
Starting Epoch 4:
Training loss = 115.329491, training accuracy = 0.960737
Evaluation accuracy = 0.976049, Elapsed Time = 0.530590s
Starting Epoch 5:
Training loss = 101.911728, training accuracy = 0.966179
Evaluation accuracy = 0.974095, Elapsed Time = 0.529574s
Starting Epoch 6:
Training loss = 90.820244, training accuracy = 0.969969
Evaluation accuracy = 0.980983, Elapsed Time = 0.530502s
Starting Epoch 7:
Training loss = 86.718071, training accuracy = 0.971037
Evaluation accuracy = 0.977590, Elapsed Time = 0.531085s
Starting Epoch 8:
Training loss = 79.507553, training accuracy = 0.973675
Evaluation accuracy = 0.976562, Elapsed Time = 0.529935s
Starting Epoch 9:
Training loss = 78.784409, training accuracy = 0.974025
Evaluation accuracy = 0.980469, Elapsed Time = 0.530919s
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services