You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/09 22:49:13 UTC
[GitHub] [incubator-mxnet] ArmageddonKnight commented on issue #14383: MXNET_BACKWARD_DO_MIRROR is broken

ArmageddonKnight commented on issue #14383: MXNET_BACKWARD_DO_MIRROR is broken
URL: https://github.com/apache/incubator-mxnet/issues/14383#issuecomment-509838577
 
 
   @antinucleon Given below is the outputs that I got from the LSTM-based NMT model (from the *Sockeye* toolkit):
   
   ### Baseline 
   
   ```
   [INFO:root] Global Step[50] Epoch[0] Batch [50]	Speed: 261.10 samples/sec	perplexity=921.880075	Memory Usage (MB): pid_21652=6179, 	PE Usage (W, J): dev_0=78.96,1935.41, 
   [INFO:root] Global Step[100] Epoch[0] Batch [100]	Speed: 285.67 samples/sec	perplexity=616.976020	Memory Usage (MB): pid_21652=6179, 	PE Usage (W, J): dev_0=82.94,3793.57, 
   [INFO:root] Global Step[150] Epoch[0] Batch [150]	Speed: 294.01 samples/sec	perplexity=496.731365	Memory Usage (MB): pid_21652=6601, 	PE Usage (W, J): dev_0=95.77,5878.26, 
   [INFO:root] Global Step[200] Epoch[0] Batch [200]	Speed: 310.77 samples/sec	perplexity=424.378748	Memory Usage (MB): pid_21652=6601, 	PE Usage (W, J): dev_0=77.22,7468.53, 
   [INFO:root] Global Step[250] Epoch[0] Batch [250]	Speed: 282.17 samples/sec	perplexity=369.385264	Memory Usage (MB): pid_21652=6601, 	PE Usage (W, J): dev_0=87.60,9455.43, 
   [INFO:root] Global Step[300] Epoch[0] Batch [300]	Speed: 294.64 samples/sec	perplexity=321.364135	Memory Usage (MB): pid_21652=6601, 	PE Usage (W, J): dev_0=82.16,11240.07, 
   ```
   
   ### `BACKWARD_DO_MIRROR=1`
   
   ```
   [INFO:root] Global Step[50] Epoch[0] Batch [50]	Speed: 151.09 samples/sec	perplexity=949.961463	Memory Usage (MB): pid_20928=2425, 	PE Usage (W, J): dev_0=84.69,3587.42, 
   [INFO:root] Global Step[100] Epoch[0] Batch [100]	Speed: 170.88 samples/sec	perplexity=625.173421	Memory Usage (MB): pid_20928=2425, 	PE Usage (W, J): dev_0=76.74,6461.51, 
   [INFO:root] Global Step[150] Epoch[0] Batch [150]	Speed: 178.00 samples/sec	perplexity=499.439886	Memory Usage (MB): pid_20928=2475, 	PE Usage (W, J): dev_0=84.37,9494.95, 
   [INFO:root] Global Step[200] Epoch[0] Batch [200]	Speed: 195.40 samples/sec	perplexity=426.799941	Memory Usage (MB): pid_20928=2475, 	PE Usage (W, J): dev_0=79.16,12087.66, 
   [INFO:root] Global Step[250] Epoch[0] Batch [250]	Speed: 169.05 samples/sec	perplexity=371.365061	Memory Usage (MB): pid_20928=2475, 	PE Usage (W, J): dev_0=81.68,15179.92, 
   [INFO:root] Global Step[300] Epoch[0] Batch [300]	Speed: 180.27 samples/sec	perplexity=323.268620	Memory Usage (MB): pid_20928=2475, 	PE Usage (W, J): dev_0=73.94,17805.00, 
   ```
   
   We can see from the output logs that with the same number of global steps, both roughly reach the same training quality. The memory footprint of doing backward mirroring is around **`1/3`** of that in the baseline, but at the same time this comes with around **`40%`** performance drop.
   
   I am currently still investigating on the cause of such huge performance drop. At the same time, if you have any specific benchmark of interest (preferrably small benchmark like the one you commented above because we are still in the debugging phase). Please kindly let me know. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services