You are viewing a plain text version of this content. The canonical link for it is here.

Posted to discuss-archive@mxnet.apache.org by Yazooliu via MXNet Forum <mx...@discoursemail.com.INVALID> on 2021/12/22 08:40:58 UTC

[MXNet Forum] [Performance] Multi system multi gpu distributed training slower than single system multi-gpu

Any help may needed from your side. expect your replay. thanks a lot.
I used tf2.5 to do the distributed training, device config detail info is 2 machince with 2GPUs each on them, means 2m4GPU. And these 2machinces network are connected by the optical fiber and bandwidth can be support 10Gbit/s = 1250MB/s. the device include machinces or GPU type or memory with these 2machinse are ALL same.

Let's compare following 2 test case:

1m2GPU:
I used the TF ditributed startagy MirroredStrategy to do the 1machinse 2 GPU trianing, cost TRAINING time is 1522 seconds which the training task is albert base 12layers text classification tast. the GPU memory percent used is96.88% and utilation is 95.78%;

2m4GPU:
I used the TF ditributed startagy MultiWorkerMirroredStrategy to do the 2machinse 4 GPU trianing, cost TRAINING time is 1013 seconds which the training task is albert base 12layers text classification tast. the GPU memory percent used is82.99% and utilation is 71.89%;

Comparw with 1m2GPu with 2m4GPU,
found that mulit machinces multic GPU saving time just : (1522-1013)/1522 = 33.4% (maybe) < 50% as expected . I also monitor the network bandwidth between 2 machines, found 152MB/s used in average.
The CPU used and memory used is 170% and 2.25% and all are not the bottleneck
Any method can i used to improve and accelerate the multi machinse multi GPU distributed training?

other question, during multi machinse multi GPU distributed training, the GPU memory usage is lower than the single machines multc GPU training , what's root cause of this? and GPU utilation is also becomer lower.

I also check some method to accelerate , such as use the data input pipeline(e,g, prefetch/map_and_batch/num_parallel_batches/shuffle/repeat ). seems that pipeline doesn't bring obvious acclerate change.
Do you hava any good suggestion to acclerate this ?
Thanks a lot.

---
[Visit Topic](https://discuss.mxnet.apache.org/t/multi-system-multi-gpu-distributed-training-slower-than-single-system-multi-gpu/1270/6) or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click here](https://discuss.mxnet.apache.org/email/unsubscribe/0457218833bd735f4d33c94ff500832fc536fc1239aff926c60a09d426da2abc).