You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/10 14:47:48 UTC
[GitHub] eric-haibin-lin opened a new pull request #10062: [MXNET-72] [WIP] Improve sparse.adam_update

eric-haibin-lin opened a new pull request #10062: [MXNET-72] [WIP] Improve sparse.adam_update 
URL: https://github.com/apache/incubator-mxnet/pull/10062
 
 
   ## Description ##
   Fix https://discuss.mxnet.io/t/lazy-update-with-adam-optimizer-is-much-slower-for-sparse-input/724
   
   Many kernels written for row_sparse ndarrays are slow on GPU. For example, in sparse SGD, the kernel has the following logic using `kernel::launch`:
   ```cpp
   # pragma parallel for
   for (i = 0 ... num_nonzero_rows) {
       # thread i computes an entire row in the output
       row = indices[i]
       for (col = 0 .. num_columns) 
           output[row][col] = weight - lr * grad 
   }
   ```
   Such "parallelization by rows"(**A**) works fine on CPU but is problematic on GPU, because:
   - when num_nonzero_rows is small, only a small number of cuda threads are launched. GPU utilization will be low
   -  the memory access pattern is horrible. GPU threads are launched as groups of warps. Each thread in the same warp are accessing different memory location, resulting 32x more unnecessary memory transactions. 
   
   Instead, the kernels on GPU should be "parallelized by the number of elements"(**B**) to update. This way, all threads in the same warp are accessing the same chunk of memory. 
   
   On the other hand, if I apply **B** on CPU with openmp, I see the performance is 3-4x slower. Hence I only applied **B** for GPUs in this PR. (I didn't dig deeper why this happens - the performance should be comparable if static openmp scheduling is used.I didn't check what default omp scheduling strategy is on my instance. Maybe @cjolivier01 has more insight on cpu performance?). 
   
   time(s) for 300 iterations for lazy_update=True (26x improvement)
   
   |nnr | 1280           | 12800          | 
   |----------------|----------------|---|
   |Before| 1.31863999367  | 13.2637498379  | 
   |After| 0.050815820694 | 0.430017948151 |
   
   time(s) for 30 iterations for lazy_update=False (34x improvement)
   
   |nnr | 1280           | 12800          | 
   |----------------|----------------|---|
   |Before| 22.2087020874  | 21.9239070415  | 
   |After| 0.658735990524 | 0.678998947144 |
   
   (Just want to trigger the CI. Will update destination branch to master later.)
   
   ## Checklist ##
   ### Essentials ###
   - [ ] Passed code style checking (`make lint`)
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments are documented. 
   - For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
   - [ ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] Feature1, tests, (and when applicable, API doc)
   - [ ] Feature2, tests, (and when applicable, API doc)
   
   ## Comments ##
   - If this change is a backward incompatible change, why must this change be made.
   - Interesting edge cases to note here
   
   ```
   import time
   import mxnet as mx
   import numpy as np
   import argparse
   
   mx.random.seed(0)
   np.random.seed(0)
   
   parser = argparse.ArgumentParser(description='Bench updater')
   parser.add_argument('--dim-in', type=int, default=240000, help='weight.shape[0]')
   parser.add_argument('--dim-out', type=int, default=512, help='weight.shape[1]')
   parser.add_argument('--nnr', type=int, default=5000, help='grad.indices.shape[0]')
   parser.add_argument('--repeat', type=int, default=1000, help='num repeat')
   parser.add_argument('--dense-grad', action='store_true')
   parser.add_argument('--dense-state', action='store_true')
   parser.add_argument('--cpu', action='store_true')
   args = parser.parse_args()
   dim_in = args.dim_in
   dim_out = args.dim_out
   nnr = args.nnr
   ctx = mx.cpu() if args.cpu else mx.gpu()
   
   ones = mx.nd.ones((dim_in, dim_out), ctx=ctx)
   
   if not args.dense_grad:
       weight = ones.tostype('row_sparse')
       indices = np.arange(dim_in)
       np.random.shuffle(indices)
       indices = np.unique(indices[:nnr])
       indices = mx.nd.array(indices, ctx=ctx)
       grad = mx.nd.sparse.retain(weight, indices)
   else:
       weight = ones.copy()
       grad = ones.copy()
   
   if args.dense_state:
       mean = ones.copy()
   else:
       mean = ones.tostype('row_sparse')
   
   var = mean.copy()
   
   # warmup
   for i in range(10):
       mx.nd.sparse.adam_update(weight, grad, mean, var, out=weight,
                            lr=1, wd=0, beta1=0.9, beta2=0.99, rescale_grad=0.5, epsilon=1e-8)
   weight.wait_to_read()
   a = time.time()
   for i in range(args.repeat):
       mx.nd.sparse.adam_update(weight, grad, mean, var, out=weight,
                            lr=1, wd=0, beta1=0.9, beta2=0.99, rescale_grad=0.5, epsilon=1e-8)
   weight.wait_to_read()
   b = time.time()
   print(b - a)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services