You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/08 17:32:42 UTC

[GitHub] [incubator-mxnet] rondogency opened a new issue #18672: Gluon 2.0 Dataloader should support BERT training using GluonNLP

rondogency opened a new issue #18672:
URL: https://github.com/apache/incubator-mxnet/issues/18672


   ## Description
   Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors
   
   Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:
   
   import mxnet as mx
   import numpy as np
   a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
   b = np.ndarray(shape=(19,))
   l1 = [a,b]                                   # similar to one feature of all sequences
   l2 = [a,b]
   c = [l1, l2]                                 # similar to a training instance that will be sampled against
   ds = mx.gluon.data.ArrayDataset(*c)
   dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
   print('ok') # error out before prints
   
   ## References
   https://github.com/apache/incubator-mxnet/pull/17841
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] zhreshold commented on issue #18672: Gluon 2.0 Dataloader should support BERT training using GluonNLP

Posted by GitBox <gi...@apache.org>.

zhreshold commented on issue #18672:
URL: https://github.com/apache/incubator-mxnet/issues/18672#issuecomment-655701277


   in 2.0, if `try_nopython` is set to false, then the behavior is the same as 1.0
   if `try_nopython` is true, dataset has to be converted to ndarray and the nested arrays with different types and shapes is causing the problem. If anyone can help figure out the correct layout for converting the complex bert style dataset I can help look into the fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] rondogency commented on issue #18672: Gluon 2.0 Dataloader should support BERT training using GluonNLP

Posted by GitBox <gi...@apache.org>.

rondogency commented on issue #18672:
URL: https://github.com/apache/incubator-mxnet/issues/18672#issuecomment-655656985


   @eric-haibin-lin @sxjscience @zhreshold FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] sxjscience commented on issue #18672: Gluon 2.0 Dataloader should support BERT training using GluonNLP

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #18672:
URL: https://github.com/apache/incubator-mxnet/issues/18672#issuecomment-655658852


   I can reproduce this failure message:
   
   ```python
   import mxnet as mx
   import numpy as np
   mx.npx.set_np()
   a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
   b = np.ndarray(shape=(19,))
   l1 = [a,b] # similar to one feature of all sequences
   l2 = [a,b]
   c = [l1, l2] # similar to a training instance that will be sampled against
   ds = mx.gluon.data.ArrayDataset(*c)
   dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
   print('ok') # error out before prints
   ```
   
   Error message:
   
   ```python
   ~/miniconda3/lib/python3.7/site-packages/mxnet/gluon/data/dataset.py in __mx_handle__(self)
       383                     datasets.append(data.__mx_handle__())
       384                 else:
   --> 385                     datasets.append(NDArrayDataset(arr=default_array(data)))
       386             self.handle = GroupDataset(datasets=datasets)
       387         return self.handle
   
   ~/miniconda3/lib/python3.7/site-packages/mxnet/util.py in default_array(source_array, ctx, dtype)
       936     from . import np as _mx_np
       937     if is_np_array():
   --> 938         return _mx_np.array(source_array, ctx=ctx, dtype=dtype)
       939     else:
       940         return _mx_nd.array(source_array, ctx=ctx, dtype=dtype)
   
   ~/miniconda3/lib/python3.7/site-packages/mxnet/numpy/multiarray.py in array(object, dtype, ctx)
      2407             # printing out the error raised by official NumPy's array function
      2408             # for transparency on users' side
   -> 2409             raise TypeError('{}'.format(str(e)))
      2410     ret = empty(object.shape, dtype=dtype, ctx=ctx)
      2411     if len(object.shape) == 0:
   
   TypeError: setting an array element with a sequence.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org