You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/25 06:25:35 UTC
[GitHub] [incubator-mxnet] Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size

Neutron3529 opened a new issue #15655: Performance regression for gluon dataloader with large batch size
URL: https://github.com/apache/incubator-mxnet/issues/15655
 
 
   ## Description
   
   gluon's dataloader performs terrible compared to `mx.io.NDArrayIter`
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.6
   Compiler     : MSC v.1900 64 bit (AMD64)
   Build        : ('v3.6.6:4cf1f54eb7', 'Jun 27 2018 03:37:03')
   Arch         : ('64bit', 'WindowsPE')
   ------------Pip Info-----------
   Version      : 19.1.1
   Directory    : d:\program files\python36\lib\site-packages\pip
   ----------MXNet Info-----------
   Version      : 1.4.1
   Directory    : d:\program files\python36\lib\site-packages\mxnet
   Commit hash file "d:\program files\python36\lib\site-packages\mxnet\COMMIT_HASH" not found. Not installed from pre-built package or built from source.
   Library      : ['d:\\program files\\python36\\lib\\site-packages\\mxnet\\libmxnet.dll']
   Build features:
   No runtime build feature info available
   ----------System Info----------
   Platform     : Windows-10-10.0.17758-SP0
   system       : Windows
   node         : Neutron
   release      : 10
   version      : 10.0.17758
   ----------Hardware Info----------
   machine      : AMD64
   processor    : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
   Name
   Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
   
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0249 sec, LOAD: 2.2043 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5834 sec, LOAD: 1.2048 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0020 sec, LOAD: 0.9903 sec.
   Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:841)>, DNS finished in 0.1266636848449707 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.2234 sec, LOAD: 4.6954 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.3620 sec, LOAD: 0.7789 sec.
   ```
   
   ## Minimum reproducible example
   ## Steps to reproduce
   ```
   import mxnet as mx
   def data_xform(data):
       """Move channel axis to the beginning, cast to float32, and normalize to [0, 1]."""
       return mx.ndarray.moveaxis(data, 2, 0).astype('float32') / 255
   
   train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
   train_loader = mx.gluon.data.DataLoader(train_data, shuffle=False, batch_size=10000)
   a=train_loader.__iter__()
   from time import time
   t=time()
   _=next(a)
   print(time()-t)
   ```
   It took `3.6711745262145996` seconds to execute a `next(a)`, roughly 20s for total 6 `next(a)`
   For `mx.io.NDArrayIter`, it finish iter almost immediately.
   
   ## What have you tried to solve it?
   I find the problem occurs [here](https://github.com/apache/incubator-mxnet/blob/8158ba4b0f1ebd696ec09a0b1aa6031bacb60740/python/mxnet/gluon/data/dataloader.py#L371), but I cannot fix it.
   
   What's more, with changing `batch_size` to `10000`, NDArray API failed to optimize the [MNIST model](https://mxnet.incubator.apache.org/versions/master/tutorials/python/mnist.html):
   ```
   ......
   >>> batch_size = 10000
   >>> train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
   >>> val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
   ......
   >>> import logging
   >>> logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
   >>> # create a trainable module on compute context
   ... mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
   >>> mlp_model.fit(train_iter,  # train data
   ...               eval_data=val_iter,  # validation data
   ...               optimizer='sgd',  # use SGD to train
   ...               optimizer_params={'learning_rate':0.1},  # use fixed learning rate
   ...               eval_metric='acc',  # report accuracy during training
   ...               batch_end_callback = mx.callback.Speedometer(batch_size, 100), # output progress for each 100 data batches
   ...               num_epoch=10)  # train for at most 10 dataset passes
   INFO:root:Epoch[0] Train-accuracy=0.107000
   INFO:root:Epoch[0] Time cost=0.123
   INFO:root:Epoch[0] Validation-accuracy=0.113500
   INFO:root:Epoch[1] Train-accuracy=0.112367
   INFO:root:Epoch[1] Time cost=0.183
   INFO:root:Epoch[1] Validation-accuracy=0.113500
   INFO:root:Epoch[2] Train-accuracy=0.112367
   INFO:root:Epoch[2] Time cost=0.158
   INFO:root:Epoch[2] Validation-accuracy=0.113500
   INFO:root:Epoch[3] Train-accuracy=0.112367
   INFO:root:Epoch[3] Time cost=0.504
   INFO:root:Epoch[3] Validation-accuracy=0.113500
   INFO:root:Epoch[4] Train-accuracy=0.112367
   INFO:root:Epoch[4] Time cost=0.142
   INFO:root:Epoch[4] Validation-accuracy=0.113500
   INFO:root:Epoch[5] Train-accuracy=0.112367
   INFO:root:Epoch[5] Time cost=0.164
   INFO:root:Epoch[5] Validation-accuracy=0.113500
   INFO:root:Epoch[6] Train-accuracy=0.112367
   INFO:root:Epoch[6] Time cost=0.471
   INFO:root:Epoch[6] Validation-accuracy=0.113500
   INFO:root:Epoch[7] Train-accuracy=0.112367
   INFO:root:Epoch[7] Time cost=0.167
   INFO:root:Epoch[7] Validation-accuracy=0.113500
   INFO:root:Epoch[8] Train-accuracy=0.112367
   INFO:root:Epoch[8] Time cost=0.241
   INFO:root:Epoch[8] Validation-accuracy=0.113500
   INFO:root:Epoch[9] Train-accuracy=0.112367
   INFO:root:Epoch[9] Time cost=0.307
   INFO:root:Epoch[9] Validation-accuracy=0.113500
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services