You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/29 01:58:37 UTC
[GitHub] XiaotaoChen opened a new issue #9611: program can't finished normally in dist_sync mode
XiaotaoChen opened a new issue #9611: program can't finished normally in dist_sync mode
URL: https://github.com/apache/incubator-mxnet/issues/9611
### Description
I train resnet50 with 4 nodes by using example/image-classification/train_imagenet.py file. At beginning, it run normally. But at the last epoch, **when one of workers finished its' total iterations named n. However, the others are still waiting for update in (n+1) iter.**
**The difference of the number of total iterations between workers result in this bug**
### Configuration
**the bash contenets as follows and was ran on gn2**
```
python -u /THL5/home/daodao/cxt/mxnet0.11/tools/launch.py -n 4 --launcher ssh -H hosts-4 "python -u /THL5/home/daodao/cxt/mxnet0.11/example/image-classification/train_imagenet.py --gpus 0,1,2,3 --kv-store dist_sync "
```
**hosts-4 contents**
```
gn2
gn3
gn4
gn5
```
**configuration for training**
```
num_examples:120000
num_epochs:80
```
**the deatils of configuration for training**
```
INFO:root:start with arguments Namespace(batch_size=256, benchmark=0, data_nthreads=4, data_train='/THL5/home/daodao /imagenet/mxnet_cxt/train1200000.rec', data_val='/THL5/home/daodao/imagenet/mxnet_cxt/val40000.rec', disp_batches=1, dtype='float32', gpus='0,1,2,3', image_shape='3,224,224', kv_store='dist_sync', load_epoch=None, lr=0.4, lr_factor= 0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle =10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9 , monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=120000, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
```
### Analysis
**the root of the bug**
due to the numbers of data between workers are different which are partitioned using the API of mx.io.ImageRecordIter, this cause the number of iterations in each epoch between workers is different. And the total iterations during the whole trainning stage is still different.
When the least number of total iterations worker named worker1 finished the last iter n, the other workers still need to train on iter n+1,n+2... During the update of n+1 iter, kvstore must recieve the grads from worker1, And worker1 has finished the total iterations, won't push grads to kvstore, So kvstore still waiting for the push of worker1, This cause the program can't finished normally.
### evidences
**the number of iterations in each epoch between workers**
| epoch | worker0 | worker1 | worker2 | worker3 |
| ----- | ------- | ------- | ------- | ------- |
| 1 | 1172 | 1173 | 1172 | 1172 |
| 2 | 1172 | 1172 | 1172 | 1172 |
| 3 | 1171 | 1172 | 1172 | 1172 |
| 4 | 1172 | 1172 | 1172 | 1172 |
| 5 | 1171 | 1172 | 1172 | 1172 |
| 6 | 1172 | 1172 | 1172 | 1171 |
| 7 | 1172 | 1173 | 1172 | 1172 |
| 8 | 1171 | 1172 | 1172 | 1172 |
| 9 | 1172 | 1172 | 1172 | 1172 |
| 10 | 1171 | 1172 | 1172 | 1172 |
| 11 | 1172 | 1172 | 1172 | 1171 |
| 12 | 1171 | 1172 | 1172 | 1172 |
| ... | ... | ... | ... | ... |
**log info**
```
INFO:root:Epoch[0] Rank[3] Batch[1171] TotalIter[1171] Speed: 187.74 samples/sec
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[0] Batch[1171] TotalIter[1171] Speed: 187.42 samples/sec
INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.830
INFO:root:Epoch[0] Rank[2] Batch[1171] TotalIter[1171] Speed: 188.36 samples/sec INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1585.818
INFO:root:Epoch[0] Rank[1] Batch[1172] TotalIter[1172] Speed: 189.53 samples/sec INFO:root:Epoch[0] Train-accuracy=nan
INFO:root:Epoch[0] Time cost=1587.169
INFO:root:Epoch[1] Rank[3] Batch[1] TotalIter[1173] Speed: 189.76 samples/sec
INFO:root:Epoch[1] Rank[0] Batch[1] TotalIter[1173] Speed: 187.95 samples/sec
INFO:root:Epoch[1] Rank[2] Batch[1] TotalIter[1173] Speed: 189.15 samples/sec
INFO:root:Epoch[1] Rank[3] Batch[2] TotalIter[1174] Speed: 188.17 samples/sec
INFO:root:Epoch[1] Rank[1] Batch[1] TotalIter[1174] Speed: 189.03 samples/sec
INFO:root:Epoch[1] Rank[0] Batch[2] TotalIter[1174] Speed: 189.08 samples/sec
INFO:root:Epoch[1] Rank[2] Batch[2] TotalIter[1174] Speed: 188.24 samples/sec
......
INFO:root:Epoch[79] Rank[3] Batch[1151] TotalIter[93724] Speed: 189.38 samples/sec
INFO:root:Epoch[79] Rank[1] Batch[1123] TotalIter[93724] Speed: 189.85 samples/sec
INFO:root:Epoch[79] Rank[2] Batch[1139] TotalIter[93724] Speed: 189.56 samples/sec
INFO:root:Epoch[79] Rank[0] Batch[1169] TotalIter[93724] Speed: 189.05 samples/sec
INFO:root:Epoch[79] Rank[1] Batch[1124] TotalIter[93725] Speed: 189.96 samples/sec
INFO:root:Epoch[79] Rank[3] Batch[1152] TotalIter[93725] Speed: 188.91 samples/sec
INFO:root:Epoch[79] Rank[2] Batch[1140] TotalIter[93725] Speed: 190.17 samples/sec
INFO:root:Epoch[79] Rank[0] Batch[1170] TotalIter[93725] Speed: 190.06 samples/sec
INFO:root:Epoch[79] Train-accuracy=nan
INFO:root:Epoch[79] Time cost=1582.509
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
Killed by signal 15.^M
```
According to the log and table info.
1. We can know that in the end of epoch 79, worker0 finished the 1171-th update, then print the summary of the epoch.
2. And with the table info, it's clearly the others haven't finish this epoch, still waiting update in the 1172-th iteration.
3. the TotalIter shows each worker had execute 93725 iterations during the whole training stage, And each iteration is strictly synchronous.
**ps info and gpu usage**
before i kill the relative process on each node, the info of process as below:
**gn2:** the node of running the launch.py
![](https://github.com/XiaotaoChen/sources/blob/master/gn2-ps.png)
![](https://github.com/XiaotaoChen/sources/blob/master/gn2-smi.png)
**gn3**
![](https://github.com/XiaotaoChen/sources/blob/master/gn3-ps.png)
![](https://github.com/XiaotaoChen/sources/blob/master/gn3-smi.png)
**gn4 and gn5 are the same as gn3**
**We can know those process still in waiting stage, because the usage of gpus is 0%.**
### Suggestion
The root of this bug is **io.ImageRecordIter** part which haven't splited the whole dataset evenly. And cause the different number of iterations in each epoch between workers. Maybe **io.ImageRecordIter** has more consideration of performance and efficiency.
The easiest way is:
**Assumptions:**
```
num_examples:the number of examples in the whole dataset.
num_workers: the number of workers
num_examples_each_worker: the number of examples splited to each worker.
```
**Then**
````
num_examples_each_worker = num_examples//num_workers + num_examples%num_workers
````
This is the easiest way to guarantee the same number of iterations in each epoch between workers.
### Another drawback
when i set `display_batchs=20` ,and haven't print `Rank,TotalIter`info, The output info is out of order, This will mistake people that is not strictly synchronized in **dist_sync** mode. as belows:
```
INFO:root:Epoch[55] Batch [80] Speed: 138.85 samples/sec accuracy=0.674023
INFO:root:Epoch[55] Batch [20] Speed: 191.34 samples/sec accuracy=0.670945
INFO:root:Epoch[55] Batch [100] Speed: 190.75 samples/sec accuracy=0.670312
INFO:root:Epoch[55] Batch [40] Speed: 190.62 samples/sec accuracy=0.664258
INFO:root:Epoch[55] Batch [40] Speed: 190.42 samples/sec accuracy=0.669531
INFO:root:Epoch[55] Batch [80] Speed: 190.35 samples/sec accuracy=0.655273
INFO:root:Epoch[55] Batch [60] Speed: 190.31 samples/sec accuracy=0.661133
INFO:root:Epoch[55] Batch [120] Speed: 190.26 samples/sec accuracy=0.668945
INFO:root:Epoch[55] Batch [100] Speed: 190.23 samples/sec accuracy=0.661523
INFO:root:Epoch[55] Batch [40] Speed: 190.10 samples/sec accuracy=0.671289
INFO:root:Epoch[55] Batch [120] Speed: 189.92 samples/sec accuracy=0.670117
INFO:root:Epoch[55] Batch [60] Speed: 189.77 samples/sec accuracy=0.668164
INFO:root:Epoch[55] Batch [60] Speed: 189.87 samples/sec accuracy=0.666406
INFO:root:Epoch[55] Batch [80] Speed: 189.79 samples/sec accuracy=0.651758
INFO:root:Epoch[55] Batch [100] Speed: 189.63 samples/sec accuracy=0.676758
INFO:root:Epoch[55] Batch [140] Speed: 189.74 samples/sec accuracy=0.677344
INFO:root:Epoch[55] Batch [120] Speed: 189.76 samples/sec accuracy=0.664258
INFO:root:Epoch[55] Batch [60] Speed: 189.84 samples/sec accuracy=0.678516
...............................
INFO:root:Epoch[119] Batch [220] Speed: 189.30 samples/sec accuracy=0.692578
INFO:root:Epoch[119] Batch [260] Speed: 189.36 samples/sec accuracy=0.697461
INFO:root:Epoch[119] Batch [140] Speed: 189.32 samples/sec accuracy=0.705664
INFO:root:Epoch[119] Batch [180] Speed: 189.32 samples/sec accuracy=0.696484
INFO:root:Epoch[119] Batch [160] Speed: 189.26 samples/sec accuracy=0.708984
INFO:root:Epoch[119] Batch [320] Speed: 189.30 samples/sec accuracy=0.704297
INFO:root:Epoch[119] Batch [300] Speed: 189.24 samples/sec accuracy=0.715625
INFO:root:Epoch[119] Batch [140] Speed: 189.33 samples/sec accuracy=0.692969
INFO:root:Epoch[119] Batch [240] Speed: 189.33 samples/sec accuracy=0.702148
INFO:root:Epoch[119] Batch [280] Speed: 189.26 samples/sec accuracy=0.688477
INFO:root:Epoch[119] Batch [160] Speed: 189.38 samples/sec accuracy=0.708789
INFO:root:Epoch[119] Batch [200] Speed: 189.36 samples/sec accuracy=0.706641
INFO:root:Epoch[119] Batch [180] Speed: 189.39 samples/sec accuracy=0.693750
INFO:root:Epoch[119] Batch [320] Speed: 189.39 samples/sec accuracy=0.695898
INFO:root:Epoch[119] Batch [340] Speed: 189.38 samples/sec accuracy=0.702930
INFO:root:Epoch[119] Batch [160] Speed: 189.41 samples/sec accuracy=0.694727
INFO:root:Epoch[119] Batch [260] Speed: 189.39 samples/sec accuracy=0.710156
INFO:root:Epoch[119] Batch [300] Speed: 189.40 samples/sec accuracy=0.709961
INFO:root:Epoch[119] Batch [180] Speed: 189.43 samples/sec accuracy=0.693555
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services