You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/29 01:58:37 UTC

[GitHub] XiaotaoChen opened a new issue #9611: program can't finished normally in dist_sync mode

XiaotaoChen opened a new issue #9611: program can't finished normally in dist_sync mode
URL: https://github.com/apache/incubator-mxnet/issues/9611
 
 
   ### Description
   
   I train resnet50 with 4  nodes by using example/image-classification/train_imagenet.py file.  At beginning, it run normally. But at the last epoch, **when one of workers finished its' total iterations named n. However, the others are still waiting for update in (n+1) iter.**
   
   **The difference of the number of total iterations between workers result in this bug**
   
   ### Configuration
   
   **the bash contenets as follows and was ran on gn2**
   
   ```
   python -u /THL5/home/daodao/cxt/mxnet0.11/tools/launch.py -n 4 --launcher ssh -H hosts-4 "python -u /THL5/home/daodao/cxt/mxnet0.11/example/image-classification/train_imagenet.py --gpus 0,1,2,3 --kv-store dist_sync "
   ```
   
   **hosts-4  contents**
   
   ```
   gn2
   gn3
   gn4
   gn5
   ```
   
   **configuration for training**
   
   ```
   num_examples:120000
   num_epochs:80
   ```
   
   **the deatils of configuration for training**
   
   ```
    INFO:root:start with arguments Namespace(batch_size=256, benchmark=0, data_nthreads=4, data_train='/THL5/home/daodao       /imagenet/mxnet_cxt/train1200000.rec', data_val='/THL5/home/daodao/imagenet/mxnet_cxt/val40000.rec', disp_batches=1,        dtype='float32', gpus='0,1,2,3', image_shape='3,224,224', kv_store='dist_sync', load_epoch=None, lr=0.4, lr_factor=       0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle       =10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9       , monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=120000, num_layers=50, optimizer='sgd',        pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
   ```
   
   ### Analysis
   
   **the root of the bug**
   
   due to the numbers of data between workers are different which are partitioned using the API of mx.io.ImageRecordIter, this cause the number of iterations in each epoch between workers is different. And the total iterations during the whole trainning stage is still different. 
   
   When the least number of total iterations worker named worker1 finished the last iter n, the other workers still need to train on iter n+1,n+2... During the update of n+1 iter, kvstore must recieve the grads from worker1, And worker1 has finished the total iterations, won't push grads to kvstore, So kvstore still waiting for the push of worker1, This cause the program can't finished normally.
   
    ### evidences
   
   **the number of iterations in each epoch between workers**
   
   | epoch | worker0 | worker1 | worker2 | worker3 |
   | ----- | ------- | ------- | ------- | ------- |
   | 1     | 1172    | 1173    | 1172    | 1172    |
   | 2     | 1172    | 1172    | 1172    | 1172    |
   | 3     | 1171    | 1172    | 1172    | 1172    |
   | 4     | 1172    | 1172    | 1172    | 1172    |
   | 5     | 1171    | 1172    | 1172    | 1172    |
   | 6     | 1172    | 1172    | 1172    | 1171    |
   | 7     | 1172    | 1173    | 1172    | 1172    |
   | 8     | 1171    | 1172    | 1172    | 1172    |
   | 9     | 1172    | 1172    | 1172    | 1172    |
   | 10    | 1171    | 1172    | 1172    | 1172    |
   | 11    | 1172    | 1172    | 1172    | 1171    |
   | 12    | 1171    | 1172    | 1172    | 1172    |
   | ...   | ...     | ...     | ...     | ...     |
   
   **log info**
   
   ```
   INFO:root:Epoch[0] Rank[3] Batch[1171] TotalIter[1171] Speed: 187.74 samples/sec   
   INFO:root:Epoch[0] Train-accuracy=nan
   INFO:root:Epoch[0] Time cost=1585.830
   INFO:root:Epoch[0] Rank[0] Batch[1171] TotalIter[1171] Speed: 187.42 samples/sec
   INFO:root:Epoch[0] Train-accuracy=nan
   INFO:root:Epoch[0] Time cost=1585.830
   INFO:root:Epoch[0] Rank[2] Batch[1171] TotalIter[1171] Speed: 188.36 samples/sec   INFO:root:Epoch[0] Train-accuracy=nan
   INFO:root:Epoch[0] Time cost=1585.818
   INFO:root:Epoch[0] Rank[1] Batch[1172] TotalIter[1172] Speed: 189.53 samples/sec   INFO:root:Epoch[0] Train-accuracy=nan
   INFO:root:Epoch[0] Time cost=1587.169
   INFO:root:Epoch[1] Rank[3] Batch[1] TotalIter[1173] Speed: 189.76 samples/sec   
   INFO:root:Epoch[1] Rank[0] Batch[1] TotalIter[1173] Speed: 187.95 samples/sec
   INFO:root:Epoch[1] Rank[2] Batch[1] TotalIter[1173] Speed: 189.15 samples/sec   
   INFO:root:Epoch[1] Rank[3] Batch[2] TotalIter[1174] Speed: 188.17 samples/sec   
   INFO:root:Epoch[1] Rank[1] Batch[1] TotalIter[1174] Speed: 189.03 samples/sec   
   INFO:root:Epoch[1] Rank[0] Batch[2] TotalIter[1174] Speed: 189.08 samples/sec   
   INFO:root:Epoch[1] Rank[2] Batch[2] TotalIter[1174] Speed: 188.24 samples/sec   
   ......
   INFO:root:Epoch[79] Rank[3] Batch[1151] TotalIter[93724] Speed: 189.38 samples/sec  
   INFO:root:Epoch[79] Rank[1] Batch[1123] TotalIter[93724] Speed: 189.85 samples/sec
   INFO:root:Epoch[79] Rank[2] Batch[1139] TotalIter[93724] Speed: 189.56 samples/sec  
   INFO:root:Epoch[79] Rank[0] Batch[1169] TotalIter[93724] Speed: 189.05 samples/sec   
   INFO:root:Epoch[79] Rank[1] Batch[1124] TotalIter[93725] Speed: 189.96 samples/sec  
   INFO:root:Epoch[79] Rank[3] Batch[1152] TotalIter[93725] Speed: 188.91 samples/sec 
   INFO:root:Epoch[79] Rank[2] Batch[1140] TotalIter[93725] Speed: 190.17 samples/sec   
   INFO:root:Epoch[79] Rank[0] Batch[1170] TotalIter[93725] Speed: 190.06 samples/sec  
   INFO:root:Epoch[79] Train-accuracy=nan
   INFO:root:Epoch[79] Time cost=1582.509
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   Killed by signal 15.^M
   
   ```
   
   According to the log and table info. 
   
   1. We can know that in the end of epoch 79, worker0 finished the 1171-th update, then print the summary of the epoch.
   2. And with the table info, it's clearly the others haven't finish this epoch, still waiting update in the 1172-th iteration.
   3. the TotalIter shows each worker had execute 93725 iterations during the whole training stage, And each iteration is strictly synchronous.
   
   **ps info and gpu usage**
   
   before i kill the relative process on each node, the info of process as below:
   
   **gn2:** the node of running the launch.py 
   
   ![](https://github.com/XiaotaoChen/sources/blob/master/gn2-ps.png)
   
   ![](https://github.com/XiaotaoChen/sources/blob/master/gn2-smi.png)
   
   **gn3**
   
   ![](https://github.com/XiaotaoChen/sources/blob/master/gn3-ps.png)
   
   ![](https://github.com/XiaotaoChen/sources/blob/master/gn3-smi.png)
   
   **gn4 and gn5 are the same as gn3**
   
   **We can know those process still in waiting  stage, because the usage of gpus is 0%.**
   
   ### Suggestion
   
   The root of this bug is  **io.ImageRecordIter** part which haven't splited the whole dataset evenly. And cause the different number of iterations in each epoch between workers. Maybe **io.ImageRecordIter** has more consideration of performance and efficiency.
   
   The easiest way is:
   
   **Assumptions:**
   
   ```
   num_examples:the number of examples in the whole dataset.
   num_workers: the number of workers
   num_examples_each_worker: the number of examples splited to each worker.
   ```
   
   **Then** 
   
   ````
   num_examples_each_worker = num_examples//num_workers + num_examples%num_workers
   ````
   
   This is the easiest way to guarantee the same number of iterations in each epoch between workers.
   
   ### Another drawback
   
   when i set `display_batchs=20` ,and haven't print `Rank,TotalIter`info, The output info is out of order, This will mistake people that is not strictly synchronized in **dist_sync** mode. as belows:
   
   ```
   INFO:root:Epoch[55] Batch [80]	Speed: 138.85 samples/sec	accuracy=0.674023
   INFO:root:Epoch[55] Batch [20]	Speed: 191.34 samples/sec	accuracy=0.670945
   INFO:root:Epoch[55] Batch [100]	Speed: 190.75 samples/sec	accuracy=0.670312
   INFO:root:Epoch[55] Batch [40]	Speed: 190.62 samples/sec	accuracy=0.664258
   INFO:root:Epoch[55] Batch [40]	Speed: 190.42 samples/sec	accuracy=0.669531
   INFO:root:Epoch[55] Batch [80]	Speed: 190.35 samples/sec	accuracy=0.655273
   INFO:root:Epoch[55] Batch [60]	Speed: 190.31 samples/sec	accuracy=0.661133
   INFO:root:Epoch[55] Batch [120]	Speed: 190.26 samples/sec	accuracy=0.668945
   INFO:root:Epoch[55] Batch [100]	Speed: 190.23 samples/sec	accuracy=0.661523
   INFO:root:Epoch[55] Batch [40]	Speed: 190.10 samples/sec	accuracy=0.671289
   INFO:root:Epoch[55] Batch [120]	Speed: 189.92 samples/sec	accuracy=0.670117
   INFO:root:Epoch[55] Batch [60]	Speed: 189.77 samples/sec	accuracy=0.668164
   INFO:root:Epoch[55] Batch [60]	Speed: 189.87 samples/sec	accuracy=0.666406
   INFO:root:Epoch[55] Batch [80]	Speed: 189.79 samples/sec	accuracy=0.651758
   INFO:root:Epoch[55] Batch [100]	Speed: 189.63 samples/sec	accuracy=0.676758
   INFO:root:Epoch[55] Batch [140]	Speed: 189.74 samples/sec	accuracy=0.677344
   INFO:root:Epoch[55] Batch [120]	Speed: 189.76 samples/sec	accuracy=0.664258
   INFO:root:Epoch[55] Batch [60]	Speed: 189.84 samples/sec	accuracy=0.678516
   ...............................
   INFO:root:Epoch[119] Batch [220]	Speed: 189.30 samples/sec	accuracy=0.692578
   INFO:root:Epoch[119] Batch [260]	Speed: 189.36 samples/sec	accuracy=0.697461
   INFO:root:Epoch[119] Batch [140]	Speed: 189.32 samples/sec	accuracy=0.705664
   INFO:root:Epoch[119] Batch [180]	Speed: 189.32 samples/sec	accuracy=0.696484
   INFO:root:Epoch[119] Batch [160]	Speed: 189.26 samples/sec	accuracy=0.708984
   INFO:root:Epoch[119] Batch [320]	Speed: 189.30 samples/sec	accuracy=0.704297
   INFO:root:Epoch[119] Batch [300]	Speed: 189.24 samples/sec	accuracy=0.715625
   INFO:root:Epoch[119] Batch [140]	Speed: 189.33 samples/sec	accuracy=0.692969
   INFO:root:Epoch[119] Batch [240]	Speed: 189.33 samples/sec	accuracy=0.702148
   INFO:root:Epoch[119] Batch [280]	Speed: 189.26 samples/sec	accuracy=0.688477
   INFO:root:Epoch[119] Batch [160]	Speed: 189.38 samples/sec	accuracy=0.708789
   INFO:root:Epoch[119] Batch [200]	Speed: 189.36 samples/sec	accuracy=0.706641
   INFO:root:Epoch[119] Batch [180]	Speed: 189.39 samples/sec	accuracy=0.693750
   INFO:root:Epoch[119] Batch [320]	Speed: 189.39 samples/sec	accuracy=0.695898
   INFO:root:Epoch[119] Batch [340]	Speed: 189.38 samples/sec	accuracy=0.702930
   INFO:root:Epoch[119] Batch [160]	Speed: 189.41 samples/sec	accuracy=0.694727
   INFO:root:Epoch[119] Batch [260]	Speed: 189.39 samples/sec	accuracy=0.710156
   INFO:root:Epoch[119] Batch [300]	Speed: 189.40 samples/sec	accuracy=0.709961
   INFO:root:Epoch[119] Batch [180]	Speed: 189.43 samples/sec	accuracy=0.693555
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services