You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/01/10 23:15:30 UTC

[GitHub] [incubator-mxnet] zhreshold opened a new issue #17269: [mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2)

zhreshold opened a new issue #17269: [mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2)
URL: https://github.com/apache/incubator-mxnet/issues/17269
 
 
   ## Description
   This is the part 2 of Gluon Data API extension and fixes, which mainly focus on speed up the current data loading pipeline using gluon dataset and dataloader.
   
   ## Motivation
   
   The current data loading pipeline is the major bottleneck for many training tasks. We can summarize the entire flow as:
   
   ```bash
   | Dataset.__getitem__ -> 
   | Transform.__call__()/forward() ->
   | Batchify ->
   | (optional communicate through shared_mem) ->
   | split_and_load(ctxs) ->
   | <training on GPUs>
   -> 
   ```
   where there are performance concerns:
   - performance of python dataset/transform functions aren't satisfying
   - it's not easy to embrace multithreading to speed up dataloading due to global interpreter lock
   - python multiprocessing is unfortunately slow and error prune, not to mention the shared memory implementations on different OS are quite difference and very annoying(e.g., it's very likely to run out of shared memory if not properly taken care of)
   - currently memory planing for batchify is non-exist, causing frequent alloc/dealloc for large chunk of memory if the batch size is big
   - batchify then split and load can be optimized to partial_batchify
   
   ## Proposal
   To alleviate the existing troubles I propose to use a hybrid solution, that is to 
   - provide C++ Datasets that can cover the most usecases
       ```python
       from gluon.data.dataset import TupleDataset, ImageFolderDataset, ArrayDataset
       # as long as TupleDataset, ImageSequenceDataset, ArrayDataset are supported by backend
       dataset = TupleDataset([ImageSequenceDataset(img_paths), ArrayDataset(image_labels)])
       # dataset is an image classification dataset while fully supported in C++
       # with TupleDataset we can combine as many data as possible
   
       # a C++ backed Dataset can have a magic __handle__ method to return the c++ handle for reference
       class TupleDataset:
           def __init__(self, datasets):
               if all([callable(getattr(dataset, '__handle__')) for dataset in datasets]):
                   # all supported by backend
                   self._tuple_dataset = check_call(_LIB.MXTupleDatasetCreate([getattr(dataset, '__handle__') for dataset in datasets]))
               else:
                   self._tuple_dataset = None
   
               def __handle__(self):
                   return self._tuple_dataset
                       
       ```
   - provide common C++ batchify functions that are split and context aware. Batchify with memory planner is TBD.
   - provide a C++ `MultithreadingDataLoader` which inherit the same arguments as `gluon.data.DataLoader` but use mxnet internal multithreading rather than python multiprocessing.
   - fallback to python multiprocessing whenever 
       - the dataset is not fully supported by backend(e.g., there are custom python datasets)
       - Transform is not fully hybridizable
       - Batchify is not fully supported by backend
   
   User will continue to use the existing `gluon.data.DataLoader`, and the conversion will be applied automatically
   ```python
   
   loader = gluon.data.DataLoader(hybrid_dataset.transform(hybrid_transform), batch_size=32, batchify_fn=hybrid_batchify)
   
   def DataLoader:
       def __init__(self, dataset, ...):
           if isinstance(dataset, _LazyTransformDataset) and is_hybrid(dataset._transform) and is_hybrid(dataset) and is_hybrid(batchify_fn):
               self._mt_dataloader = check_call(_LIB.MXMultiThreadDataLoaderCreate(...))
       def __iter__(self):
           if self._mt_dataloader:
                   return self._mt_dataloader
           else:
                  # fallback to single thread normal dataloader or multiprocessing dataloader
   
   ```
   
   With this change, mxnet 2.0 will get smooth transition to mixed data loaders. Please comment with specific examples where this proposal fail to accommodate.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zhreshold commented on issue #17269: [mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2)

Posted by GitBox <gi...@apache.org>.

zhreshold commented on issue #17269: [mxnet 2.0][item 4.8][RFC] Gluon Data API Extension and Fixes(Part 2)
URL: https://github.com/apache/incubator-mxnet/issues/17269#issuecomment-573242957
 
 
   @szha @eric-haibin-lin @sxjscience @szhengac Request for comments regarding NLP dataloading

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services