You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/05/11 04:55:39 UTC
[GitHub] [incubator-mxnet] ZhennanQin opened a new pull request #14931:
Re-enable static cached_op optimization
ZhennanQin opened a new pull request #14931: Re-enable static cached_op optimization
URL: https://github.com/apache/incubator-mxnet/pull/14931
## Description ##
This PR firstly filed at https://github.com/apache/incubator-mxnet/pull/14785, but reverted because of https://github.com/dmlc/gluon-nlp/issues/690. Now it's fixed.
The fixed output is,
```
$ python run_pretraining.py --gpus 0 --batch_size 32 --lr 2e-5 --data 'out/*.npz' --warmup_ratio 0.5 --num_steps 20 --pretrained --log_interval=2 --data_eval 'out/*.npz' --batch_size_eval 8 --ckpt_dir ckpt
INFO:root:Namespace(accumulate=1, batch_size=32, batch_size_eval=8, ckpt_dir='ckpt', ckpt_interval=250000, data='out/*.npz', data_eval='out/*.npz', dataset_name='book_corpus_wiki_en_uncased', dtype='float32', dummy_data_len=None, gpus='0', kvstore='device', log_interval=2, lr=2e-05, model='bert_12_768_12', num_buckets=1, num_steps=20, pretrained=True, profile=None, seed=0, start_step=0, use_avg_len=False, verbose=False, warmup_ratio=0.5)
[12:41:47] src/storage/storage.cc:108: Using GPUPooledRoundedStorageManager.
INFO:root:Using training data at out/*.npz
INFO:root:[step 1] mlm_loss=1.57931 mlm_acc=46.99793 nsp_loss=0.32225 nsp_acc=84.375 throughput=2.4K tks/s lr=0.0000020 time=1.34, latency=668.3 ms/batch
INFO:root:[step 3] mlm_loss=3.27771 mlm_acc=48.94737 nsp_loss=0.47513 nsp_acc=88.333 throughput=6.8K tks/s lr=0.0000060 time=0.93, latency=466.5 ms/batch
INFO:root:[step 5] mlm_loss=2.79173 mlm_acc=50.60241 nsp_loss=0.12542 nsp_acc=92.857 throughput=6.2K tks/s lr=0.0000100 time=0.91, latency=453.5 ms/batch
INFO:root:[step 7] mlm_loss=2.81245 mlm_acc=52.21421 nsp_loss=0.25357 nsp_acc=92.188 throughput=6.5K tks/s lr=0.0000140 time=1.00, latency=499.9 ms/batch
INFO:root:[step 9] mlm_loss=2.09288 mlm_acc=62.17143 nsp_loss=0.01765 nsp_acc=100.000 throughput=6.3K tks/s lr=0.0000180 time=0.93, latency=464.5 ms/batch
INFO:root:[step 11] mlm_loss=1.53132 mlm_acc=71.63121 nsp_loss=0.00595 nsp_acc=100.000 throughput=6.8K tks/s lr=0.0000090 time=0.98, latency=490.6 ms/batch
INFO:root:[step 13] mlm_loss=1.04932 mlm_acc=79.85866 nsp_loss=0.00190 nsp_acc=100.000 throughput=5.9K tks/s lr=0.0000070 time=0.97, latency=484.6 ms/batch
INFO:root:[step 15] mlm_loss=0.93987 mlm_acc=81.75027 nsp_loss=0.00284 nsp_acc=100.000 throughput=6.3K tks/s lr=0.0000050 time=1.00, latency=500.4 ms/batch
INFO:root:[step 17] mlm_loss=0.68529 mlm_acc=86.50794 nsp_loss=0.00250 nsp_acc=100.000 throughput=6.6K tks/s lr=0.0000030 time=1.03, latency=516.9 ms/batch
INFO:root:[step 19] mlm_loss=0.50951 mlm_acc=89.44900 nsp_loss=0.00218 nsp_acc=100.000 throughput=6.2K tks/s lr=0.0000010 time=0.92, latency=460.7 ms/batch
INFO:root:[step 20] Saving checkpoints to ckpt/0000020.params, ckpt/0000020.states.
INFO:root:Train cost=26.3s
INFO:root:Using evaluation data at out/*.npz
INFO:root:[step 1] mlm_loss=0.20752 mlm_acc=93.26923 nsp_loss=0.00008 nsp_acc=100.000 throughput=3.0K tks/s lr=0.0000000 time=0.23, latency=115.1 ms/batch
INFO:root:[step 3] mlm_loss=0.43117 mlm_acc=91.76955 nsp_loss=0.00139 nsp_acc=100.000 throughput=10.5K tks/s lr=0.0000000 time=0.15, latency=77.5 ms/batch
INFO:root:[step 5] mlm_loss=0.39373 mlm_acc=92.27941 nsp_loss=0.00016 nsp_acc=100.000 throughput=12.0K tks/s lr=0.0000000 time=0.15, latency=76.4 ms/batch
INFO:root:[step 7] mlm_loss=0.37862 mlm_acc=91.18943 nsp_loss=0.00093 nsp_acc=100.000 throughput=10.4K tks/s lr=0.0000000 time=0.15, latency=73.4 ms/batch
INFO:root:mlm_loss=0.406 mlm_acc=92.0 nsp_loss=0.001 nsp_acc=100.0
INFO:root:Eval cost=0.7s
```
@pengzhao-intel @eric-haibin-lin
## Checklist ##
### Essentials ###
Please feel free to remove inapplicable items for your PR.
- [ ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) created (except PRs with tiny changes)
- [ ] Changes are complete (i.e. I finished coding on this PR)
- [ ] All changes have test coverage:
- Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
- Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
- Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
- [ ] Code is well-documented:
- For user-facing API changes, API doc string has been updated.
- For new C++ functions in header files, their functionalities and arguments are documented.
- For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
- Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
- [ ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
### Changes ###
- [ ] Feature1, tests, (and when applicable, API doc)
- [ ] Feature2, tests, (and when applicable, API doc)
## Comments ##
- If this change is a backward incompatible change, why must this change be made.
- Interesting edge cases to note here
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services