You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/16 19:55:38 UTC

[GitHub] Ishitori opened a new issue #13909: gluon.utils.split_and_load(even_split=False) fails if num of contexts > num of data

Ishitori opened a new issue #13909: gluon.utils.split_and_load(even_split=False) fails if num of contexts > num of data
URL: https://github.com/apache/incubator-mxnet/issues/13909
 
 
   ## Description
   Sometimes, it is hard to predict how much data left in the dataloader by the time last batch comes. When doing multigpu training with `last_batch=keep`, it could happen that the number of items left in the last batch is smaller than numbers of gpus. In that case `gluon.utils.split_and_load` throws an exception `ValueError: Too many slices for data with shape ...`.
   
   It would be great if it could work transparently. I would expect that if the parameter `even_split` on DataLoader is set to `False`, then exception shouldn't happen: the data should be distributed in a way that some arrays are empty, and later on the calculation of forward and backward passes with empty arrays are just silently ignored.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.4
   Compiler     : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
   Build        : ('default', 'Jan 16 2018 12:04:33')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 18.0
   Directory    : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   /Users/sssokolo/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   Version      : 1.5.0
   Directory    : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/mxnet
   Commit Hash   : fd34dc5f847192dfd522555afdf13be1eb67b72b
   ----------System Info----------
   Platform     : Darwin-16.7.0-x86_64-i386-64bit
   system       : Darwin
   node         : 8c859074eea0
   release      : 16.7.0
   version      : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
   b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0299 sec, LOAD: 0.6207 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.1785 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0008 sec, LOAD: 0.1612 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0007 sec, LOAD: 0.1032 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0007 sec, LOAD: 0.4562 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0006 sec, LOAD: 0.0634 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   Python
   
   ## Error Message:
   ```
   Traceback (most recent call last):
     File "/Volumes/Unix/workspace/exception_small_batch_to_split/main.py", line 25, in <module>
       data = utils.split_and_load(data, context, even_split=False)
     File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 116, in split_and_load
       slices = split_data(data, len(ctx_list), batch_axis, even_split)
     File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 69, in split_data
       "num_slice=%d and batch_axis=%d."%(str(data.shape), num_slice, batch_axis))
   ValueError: Too many slices for data with shape (1, 5). Arguments are num_slice=2 and batch_axis=0.
   ```
   ## Minimum reproducible example
   Just regular minimal multicontext training loop is enough:
   
   ```
   import mxnet as mx
   from mxnet import nd, gluon, autograd
   from mxnet.gluon import utils, Trainer
   from mxnet.gluon.data import ArrayDataset, DataLoader
   from mxnet.gluon.loss import SoftmaxCrossEntropyLoss
   
   context = [mx.cpu(0), mx.cpu(1)]
   datasize = 3
   batch_size_per_context = 1
   
   data = nd.random.uniform(-1, 1, shape=(datasize, 5))
   label = nd.random.uniform(-1, 1, shape=(datasize, 1))
   
   dataset = ArrayDataset(data, label)
   dataloader = DataLoader(dataset,
                           batch_size=len(context) * batch_size_per_context,
                           last_batch='keep')
   
   net = gluon.nn.Dense(units=2)
   net.initialize(ctx=context)
   
   loss_fn = SoftmaxCrossEntropyLoss()
   trainer = Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
   
   for (data, label) in dataloader:
       data = utils.split_and_load(data, context, even_split=False)
       label = utils.split_and_load(label, context, even_split=False)
   
       losses = []
   
       for d, l in zip(data, label):
           with autograd.record():
               out = net(d)
               losses.append(loss_fn(out, l))
   
       for loss in losses:
           loss.backward()
   
       trainer.step(1)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services