You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/25 06:25:35 UTC
[GitHub] [incubator-mxnet] Neutron3529 opened a new issue #15655:
Performance regression for gluon dataloader with large batch size
Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size
URL: https://github.com/apache/incubator-mxnet/issues/15655
## Description
gluon's dataloader performs terrible compared to `mx.io.NDArrayIter`
## Environment info (Required)
```
----------Python Info----------
Version : 3.6.6
Compiler : MSC v.1900 64 bit (AMD64)
Build : ('v3.6.6:4cf1f54eb7', 'Jun 27 2018 03:37:03')
Arch : ('64bit', 'WindowsPE')
------------Pip Info-----------
Version : 19.1.1
Directory : d:\program files\python36\lib\site-packages\pip
----------MXNet Info-----------
Version : 1.4.1
Directory : d:\program files\python36\lib\site-packages\mxnet
Commit hash file "d:\program files\python36\lib\site-packages\mxnet\COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library : ['d:\\program files\\python36\\lib\\site-packages\\mxnet\\libmxnet.dll']
Build features:
No runtime build feature info available
----------System Info----------
Platform : Windows-10-10.0.17758-SP0
system : Windows
node : Neutron
release : 10
version : 10.0.17758
----------Hardware Info----------
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Name
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0249 sec, LOAD: 2.2043 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5834 sec, LOAD: 1.2048 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0020 sec, LOAD: 0.9903 sec.
Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:841)>, DNS finished in 0.1266636848449707 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.2234 sec, LOAD: 4.6954 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.3620 sec, LOAD: 0.7789 sec.
```
## Minimum reproducible example
## Steps to reproduce
```
import mxnet as mx
def data_xform(data):
"""Move channel axis to the beginning, cast to float32, and normalize to [0, 1]."""
return mx.ndarray.moveaxis(data, 2, 0).astype('float32') / 255
train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
train_loader = mx.gluon.data.DataLoader(train_data, shuffle=False, batch_size=10000)
a=train_loader.__iter__()
from time import time
t=time()
_=next(a)
print(time()-t)
```
It took `3.6711745262145996` seconds to execute a `next(a)`, roughly 20s for total 6 `next(a)`
For `mx.io.NDArrayIter`, it finish iter almost immediately.
## What have you tried to solve it?
I find the problem occurs [here](https://github.com/apache/incubator-mxnet/blob/8158ba4b0f1ebd696ec09a0b1aa6031bacb60740/python/mxnet/gluon/data/dataloader.py#L371), but I cannot fix it.
What's more, with changing `batch_size` to `10000`, NDArray API failed to optimize the [MNIST model](https://mxnet.incubator.apache.org/versions/master/tutorials/python/mnist.html):
```
......
>>> batch_size = 10000
>>> train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
>>> val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
......
>>> import logging
>>> logging.getLogger().setLevel(logging.DEBUG) # logging to stdout
>>> # create a trainable module on compute context
... mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
>>> mlp_model.fit(train_iter, # train data
... eval_data=val_iter, # validation data
... optimizer='sgd', # use SGD to train
... optimizer_params={'learning_rate':0.1}, # use fixed learning rate
... eval_metric='acc', # report accuracy during training
... batch_end_callback = mx.callback.Speedometer(batch_size, 100), # output progress for each 100 data batches
... num_epoch=10) # train for at most 10 dataset passes
INFO:root:Epoch[0] Train-accuracy=0.107000
INFO:root:Epoch[0] Time cost=0.123
INFO:root:Epoch[0] Validation-accuracy=0.113500
INFO:root:Epoch[1] Train-accuracy=0.112367
INFO:root:Epoch[1] Time cost=0.183
INFO:root:Epoch[1] Validation-accuracy=0.113500
INFO:root:Epoch[2] Train-accuracy=0.112367
INFO:root:Epoch[2] Time cost=0.158
INFO:root:Epoch[2] Validation-accuracy=0.113500
INFO:root:Epoch[3] Train-accuracy=0.112367
INFO:root:Epoch[3] Time cost=0.504
INFO:root:Epoch[3] Validation-accuracy=0.113500
INFO:root:Epoch[4] Train-accuracy=0.112367
INFO:root:Epoch[4] Time cost=0.142
INFO:root:Epoch[4] Validation-accuracy=0.113500
INFO:root:Epoch[5] Train-accuracy=0.112367
INFO:root:Epoch[5] Time cost=0.164
INFO:root:Epoch[5] Validation-accuracy=0.113500
INFO:root:Epoch[6] Train-accuracy=0.112367
INFO:root:Epoch[6] Time cost=0.471
INFO:root:Epoch[6] Validation-accuracy=0.113500
INFO:root:Epoch[7] Train-accuracy=0.112367
INFO:root:Epoch[7] Time cost=0.167
INFO:root:Epoch[7] Validation-accuracy=0.113500
INFO:root:Epoch[8] Train-accuracy=0.112367
INFO:root:Epoch[8] Time cost=0.241
INFO:root:Epoch[8] Validation-accuracy=0.113500
INFO:root:Epoch[9] Train-accuracy=0.112367
INFO:root:Epoch[9] Time cost=0.307
INFO:root:Epoch[9] Validation-accuracy=0.113500
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services