You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/19 08:36:48 UTC

[GitHub] Rikorose opened a new issue #9493: Data Loading API for NDArray

Rikorose opened a new issue #9493: Data Loading API for NDArray
URL: https://github.com/apache/incubator-mxnet/issues/9493
 
 
   ## Description
   I am struggling to find an efficient way of loading non image data that has the shape of `(f, t)`, where `f` are my number of features and `t` corresponds to time and thus is a variable size number. I need to preprocess my data and currently do this with numpy. The preprocessing is costly so I only want to do this once and save the preprocessed data in a `MXRecord`.
   For images there are several examples ([preprocess to record](https://github.com/apache/incubator-mxnet/commits/master/tools/im2rec.py), [iterator for gluon](https://github.com/apache/incubator-mxnet/blob/11cb609002fc202f74fcaadcde76e1efefa17f05/example/gluon/data.py)) But how is the expected 'best practice' way to store `mx.nd.array` and load the data (shuffled, batched, epoch repeated, prefetched/buffered using several threads)?
   
   ## What have you tried to solve it?
   I currently store the data using numpy and write it to the record.
   ```py
   path_idx = os.path.join(args.record_dir, 'data.idx')
   path_rec = os.path.join(args.record_dir, 'data.rec')
   record = mx.recordio.MXIndexedRecordIO(path_idx, path_rec, 'w')
   for i in range(10):
     features = np.zeros(shape=(222, 3179), dtype=np.float32)
     label = np.zeros(shape=(3179))
     # This is really not nice, how to improve this, maybe using mx.nd.array?
     buffer = io.BytesIO()
     np.save(buffer, features)
     header = mx.recordio.IRHeader(0, label, i, 0)
     s = mx.recordio.pack(header, buffer.getvalue())
     record.write_idx(i)
   ```
   
   How can I access the data, shuffle the whole data, without loading the complete data set into memory?
   
   In fact I was wondering, why there is only a `ImageRecordIter`, but no general iterator for `NDArray` or arbitrary data. Maybe I am just spoiled using `tf.data.Dataset`.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services