You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2019/11/05 13:49:37 UTC

[GitHub] [singa] chrishkchris opened a new pull request #558: SINGA-487 Parallelize Computation and Communication using CUDA stream concurrency

chrishkchris opened a new pull request #558: SINGA-487 Parallelize Computation and Communication using CUDA stream concurrency
URL: https://github.com/apache/singa/pull/558
 
 
   This PR deals with the Parallelization of Computation and Communication using CUDA stream concurrency, which can reduce the communication overhead.
   
   Together with the result of PR #555, here is a simple test to make sure the training is correct:
   ```
   ubuntu@ip-172-31-29-119:~/singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py
   Starting Epoch 0:
   Training loss = 931.969849, training accuracy = 0.675197
   Evaluation accuracy = 0.913137, Elapsed Time = 0.733470s
   Starting Epoch 1:
   Training loss = 280.136505, training accuracy = 0.910273
   Evaluation accuracy = 0.954975, Elapsed Time = 0.642032s
   Starting Epoch 2:
   Training loss = 188.183517, training accuracy = 0.939837
   Evaluation accuracy = 0.967619, Elapsed Time = 0.650523s
   Starting Epoch 3:
   Training loss = 147.724915, training accuracy = 0.952941
   Evaluation accuracy = 0.971012, Elapsed Time = 0.639127s
   Starting Epoch 4:
   Training loss = 125.514275, training accuracy = 0.959402
   Evaluation accuracy = 0.974404, Elapsed Time = 0.637774s
   Starting Epoch 5:
   Training loss = 113.583031, training accuracy = 0.963174
   Evaluation accuracy = 0.974918, Elapsed Time = 0.638678s
   Starting Epoch 6:
   Training loss = 105.422485, training accuracy = 0.965895
   Evaluation accuracy = 0.979852, Elapsed Time = 0.637032s
   Starting Epoch 7:
   Training loss = 94.718765, training accuracy = 0.968850
   Evaluation accuracy = 0.976871, Elapsed Time = 0.638873s
   Starting Epoch 8:
   Training loss = 87.026405, training accuracy = 0.971421
   Evaluation accuracy = 0.976768, Elapsed Time = 0.637387s
   Starting Epoch 9:
   Training loss = 79.878670, training accuracy = 0.973708
   Evaluation accuracy = 0.981805, Elapsed Time = 0.639177s
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services