You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/27 08:26:36 UTC

[GitHub] [incubator-mxnet] YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training with distributed parameter server

YouhuiBai opened a new issue #15674: Straggler in latest mxnet when training with distributed parameter server
URL: https://github.com/apache/incubator-mxnet/issues/15674

## Description
Hi, I found that there is a strange straggler in current newest mxnet when training CNN using distributed parameter server architecture (BSP model) and GPU, it is a special worker whose `rank == 0`. I think it is a bug of mxnet , because I deployed mxnet in hemogeneous environment means that every participated machine has the same hardware and software environment as follows, and the straggler still existed even I changed the number of workers or physical machine like running at AWS.

## Environment info (Required)

```
system version：CentOS 7.5.1804
kernel version：3.10.0-862.9.1.el7.x86_64
cuda version：cuda_9.2.148
cudnn version：cudnn-9.2-linux-x64-v7.1
nvidia driver version：396.37
GPU: GeForce GTX 1080 Ti
NIC: 10GE

```
Software and parameters:
```
parameter server architecture: m servers n workers, n >= m and each role locates on different physical machine
application: image classification
database: Imagenet 2012
CNN model: inception-v4, lenet, resnet, etc.
GPU usage: one physical GPU per worker
scaling model: strong scaling
consistency model: BSP
```

## what is a straggler?
When I start training with above environment and parameter set up, and did some break down in the critical path, found that a worker's behavior is strange. In BSP consistency model of parameter server, the server would not execute response of one key for push operations unless receiving updates from all workers to the same key, we found that there was a slower worker, always waited by other workers every iteration, it is the straggler. The straggler has other features:

1. rank == 0, the first worker
2. higher CPU usage
3. higher CPU memory throughput
4. higher GPU usage
5. cost more time when calling cudamemcopy
6. higher LLC miss rate
7. lower CPU memory occupancy

It's very very strange. Thanks a lot.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services