You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/09 22:49:13 UTC
[GitHub] [incubator-mxnet] ArmageddonKnight commented on issue #14383:
MXNET_BACKWARD_DO_MIRROR is broken
ArmageddonKnight commented on issue #14383: MXNET_BACKWARD_DO_MIRROR is broken
URL: https://github.com/apache/incubator-mxnet/issues/14383#issuecomment-509838577
@antinucleon Given below is the outputs that I got from the LSTM-based NMT model (from the *Sockeye* toolkit):
### Baseline
```
[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 261.10 samples/sec perplexity=921.880075 Memory Usage (MB): pid_21652=6179, PE Usage (W, J): dev_0=78.96,1935.41,
[INFO:root] Global Step[100] Epoch[0] Batch [100] Speed: 285.67 samples/sec perplexity=616.976020 Memory Usage (MB): pid_21652=6179, PE Usage (W, J): dev_0=82.94,3793.57,
[INFO:root] Global Step[150] Epoch[0] Batch [150] Speed: 294.01 samples/sec perplexity=496.731365 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=95.77,5878.26,
[INFO:root] Global Step[200] Epoch[0] Batch [200] Speed: 310.77 samples/sec perplexity=424.378748 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=77.22,7468.53,
[INFO:root] Global Step[250] Epoch[0] Batch [250] Speed: 282.17 samples/sec perplexity=369.385264 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=87.60,9455.43,
[INFO:root] Global Step[300] Epoch[0] Batch [300] Speed: 294.64 samples/sec perplexity=321.364135 Memory Usage (MB): pid_21652=6601, PE Usage (W, J): dev_0=82.16,11240.07,
```
### `BACKWARD_DO_MIRROR=1`
```
[INFO:root] Global Step[50] Epoch[0] Batch [50] Speed: 151.09 samples/sec perplexity=949.961463 Memory Usage (MB): pid_20928=2425, PE Usage (W, J): dev_0=84.69,3587.42,
[INFO:root] Global Step[100] Epoch[0] Batch [100] Speed: 170.88 samples/sec perplexity=625.173421 Memory Usage (MB): pid_20928=2425, PE Usage (W, J): dev_0=76.74,6461.51,
[INFO:root] Global Step[150] Epoch[0] Batch [150] Speed: 178.00 samples/sec perplexity=499.439886 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=84.37,9494.95,
[INFO:root] Global Step[200] Epoch[0] Batch [200] Speed: 195.40 samples/sec perplexity=426.799941 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=79.16,12087.66,
[INFO:root] Global Step[250] Epoch[0] Batch [250] Speed: 169.05 samples/sec perplexity=371.365061 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=81.68,15179.92,
[INFO:root] Global Step[300] Epoch[0] Batch [300] Speed: 180.27 samples/sec perplexity=323.268620 Memory Usage (MB): pid_20928=2475, PE Usage (W, J): dev_0=73.94,17805.00,
```
We can see from the output logs that with the same number of global steps, both roughly reach the same training quality. The memory footprint of doing backward mirroring is around **`1/3`** of that in the baseline, but at the same time this comes with around **`40%`** performance drop.
I am currently still investigating on the cause of such huge performance drop. At the same time, if you have any specific benchmark of interest (preferrably small benchmark like the one you commented above because we are still in the debugging phase). Please kindly let me know. Thanks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services