You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/26 13:39:13 UTC

[GitHub] Jeffery4000 opened a new issue #10247: Slow speed on distributed training

Jeffery4000 opened a new issue #10247: Slow speed on distributed training 
URL: https://github.com/apache/incubator-mxnet/issues/10247
 
 
   Hi, I'm trying to work out with multiple node with 4 GPUs each after able to achieve almost linear speed up on one single node with 4 GPUs. The dataset had been used was the Imagenet_2012 and tested on resnet50 base on the example provided by mxnet. 
   
   I fire up the code by launching the command below and I am trying to fully utilized GPU memory by controlling the batch size and use the maximum thread for data decoding, after running the test-io benchmark, each node are able to decode approximate 1000 sample/sec :
   
   `python ../../tools/launch.py -n 2 -H hosts python train_imagenet.py --network resnet --num-layers 50 --gpus 0,1 --batch-size 180 --data-nthreads 15 --top-k 5 --data-train ./data/train_data.rec --data-val ./data/val_data.rec --num-epochs 1 --kv-store dist_device_sync`
   
   Result:
   Single node with 1 GPUs: 122 samples/sec
   Single node with 2 GPUs: 234 samples/sec
   Single node with 3 GPUs: 353 samples/sec
   Single node with 4 GPUs:  477 samples/sec
   Two node with 1 GPUs each: 111 samples/sec on each machine = 222 samples/sec
   Two node with 2 GPUs each: 220 samples/sec on each machine = 440 samples/sec
   Two node with 3 GPUs each: 300 samples/sec on each machine = 600samples/sec
   Two node with 4 GPUs each: 355 samples/sec on each machine = 710 samples/sec 
   
   From the result above, I just wan to understand why there is a great reduce in speed base when its training on two node with 4 GPUs each. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services