You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2019/11/12 08:37:09 UTC

[GitHub] [singa] chrishkchris opened a new pull request #560: SINGA-487 Accumulate gradients to reduce network latency

chrishkchris opened a new pull request #560: SINGA-487 Accumulate gradients to reduce network latency
URL: https://github.com/apache/singa/pull/560
 
 
   This PR reduces the network latency by accumulate gradients in a memory buffer before sending out with NCCL.
   This can reduce much of the TCP/IP latency by reducing the number of NCCL API call.
   
   Together with the result of PR #555, here is a simple test to make sure the training is correct:
   ```
   ubuntu@ip-172-31-26-214:~/singa/examples/autograd$ python3 mnist_multiprocess.py
   Starting Epoch 0:
   Training loss = 831.072205, training accuracy = 0.700454
   Evaluation accuracy = 0.927015, Elapsed Time = 0.676089s
   Starting Epoch 1:
   Training loss = 248.684601, training accuracy = 0.916183
   Evaluation accuracy = 0.958265, Elapsed Time = 0.545179s
   Starting Epoch 2:
   Training loss = 172.330597, training accuracy = 0.943042
   Evaluation accuracy = 0.967928, Elapsed Time = 0.543617s
   Starting Epoch 3:
   Training loss = 139.254807, training accuracy = 0.953425
   Evaluation accuracy = 0.973067, Elapsed Time = 0.530805s
   Starting Epoch 4:
   Training loss = 115.329491, training accuracy = 0.960737
   Evaluation accuracy = 0.976049, Elapsed Time = 0.530590s
   Starting Epoch 5:
   Training loss = 101.911728, training accuracy = 0.966179
   Evaluation accuracy = 0.974095, Elapsed Time = 0.529574s
   Starting Epoch 6:
   Training loss = 90.820244, training accuracy = 0.969969
   Evaluation accuracy = 0.980983, Elapsed Time = 0.530502s
   Starting Epoch 7:
   Training loss = 86.718071, training accuracy = 0.971037
   Evaluation accuracy = 0.977590, Elapsed Time = 0.531085s
   Starting Epoch 8:
   Training loss = 79.507553, training accuracy = 0.973675
   Evaluation accuracy = 0.976562, Elapsed Time = 0.529935s
   Starting Epoch 9:
   Training loss = 78.784409, training accuracy = 0.974025
   Evaluation accuracy = 0.980469, Elapsed Time = 0.530919s
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services