You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/05 06:52:23 UTC

[GitHub] asitstands opened a new pull request #9991: Random shuffle implementation

asitstands opened a new pull request #9991: Random shuffle implementation
URL: https://github.com/apache/incubator-mxnet/pull/9991
 
 
   ## Description ##
   
   This operator randomly shuffles the elements of an NDArray along the last axis, i.e., it performs a batched shuffling. For example, if you shuffle a matrix (2D array), the elements in each row are shuffled inside the row and the shufflings in different rows are statistically independent. So, for each element, all indices except the last one perserved but the last one changes randomly. 
   
   ### Motivation
   
   1. Random shuffle is a widely used operation so that most of the random number APIs, such as those of numpy or C++, have it while mxnet lacks the operation. It would be useful especially for the shuffling of the arrays in gpu.
   2. If this PR is accepted I'd like to open a separate PR which removes the dependency of mxnet python API on the *global* random number generators of python and numpy. I think that the dependency is not sound because it introduces an unnecessary coupling between mxnet and other parts of a user's environment. This can be a source of subtle technical or mathematical bugs. Personally I have spent some amount of meaningless time to figure out the source of an unexpected randomness due to the coupling. The dependency has confused many newcomers, for example, see #9978, #9335, #7124 and #736. To replace random number generations through the global generators of python and numpy with the generations through the operations of `mx.nd.random`, we need to copy the data back and forth between a numpy array and an NDArray or a python list and an NDArray, but I think that the overhead from this is not significant and avoiding the coupling is much more important. When I actually tried the replacing, the only obstacle was that mxnet does not support the random shuffling. So I implemented it. However, I think that the suffling is useful regardless of this motivation and so opened a separate PR.
   
   ### Implementation
   
   1. In cpu, the shuffling is delegated to `__gnu_parallel::random_shuffle` for gcc and to `std::shuffle` for other compilers. The shufflings in a batch are processed sequentially one by one.
   2. In gpu, a random key of `double` type is generated for each element of the array and then the elements are sorted by the keys. The keys for `n`-th batch is distributed in the range `[n, n + 1]` and the entire array is sorted by the keys at once. The sorting is delegated to mxnet's `SortByKey` which again delegates the call to thrust's `sort_by_key`. If both of the batch size and the length of the last axis are very large, the qualitiy of the shuffling could be bad in this implementation, i.e., the outcomes may not be uniformly distributed. It is because, as `n` grows,  the number of floating point numbers in `[n, n+1]` becomes smaller and so the probability that there are duplicated keys grows. However, I believe that it hardly causes a problem in practice. If this is not acceptable, an alternative implementation is to process the shufflings in a batch sequentially. However, this alternative shows very poor performance comparing to the current implementation even when the batch size is some hundreds.
   3. Inplace shuffling is allowed. 
   
   ## Checklist ##
   ### Essentials ###
   - [x] Passed code style checking (`make lint`)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
   - [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services