You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/02/20 00:22:03 UTC
[GitHub] stephenrawls opened a new issue #14210: Slow Async GPU Copy

stephenrawls opened a new issue #14210: Slow Async GPU Copy
URL: https://github.com/apache/incubator-mxnet/issues/14210
 
 
   Some of our NLP models need to use multiple data arrays, sometimes up to 6 different data ndarrays as input. This is a big problem for multi-gpu training because the CPU -> GPU copy time is very slow. In particular, whereas the *actual* time to copy data via cuda memcpy calls is fast, the python overhead with calling into the MxNet C API seems slow.
   
   Here is a sample script that shows what I mean:
   ```
   import time
   import mxnet as mx
   
   def generate_data(num_inputs, ngpus):
       return[[mx.nd.random.randn(1) for _ in range(num_inputs)] for _ in range(ngpus)]
   
   def send_to_gpu(data, ctx_list):
       return [[data[i][j].as_in_context(ctx_list[i]) for j in range(len(data[0]))] for i in range(len(data))]
   
   ctx_list = [mx.gpu(i) for i in range(8)]
   
   # Need to do this as warmup                                                                                                                                                                       
   data = generate_data(num_inputs=1, ngpus=8)
   send_to_gpu(data, ctx_list)
   
   
   for num_inputs in range(1,11):
       data = generate_data(num_inputs, ngpus=8)
   
       start = time.time()
       data = send_to_gpu(data, ctx_list)
       end = time.time()
   
       print("Num Inputs: %d. Took %f ms to set off all async copies" % (num_inputs, 1000*(end-start)))
   ```
   
   The output on a p3.16xlarge instance is:
   ```
   % python3 ~/test_send_to_gpu_gh.py
   Num Inputs: 1. Took 0.680685 ms to set off all async copies
   Num Inputs: 2. Took 2.131939 ms to set off all async copies
   Num Inputs: 3. Took 2.979040 ms to set off all async copies
   Num Inputs: 4. Took 4.072189 ms to set off all async copies
   Num Inputs: 5. Took 4.901409 ms to set off all async copies
   Num Inputs: 6. Took 7.693768 ms to set off all async copies
   Num Inputs: 7. Took 6.579638 ms to set off all async copies
   Num Inputs: 8. Took 7.871866 ms to set off all async copies
   Num Inputs: 9. Took 8.602858 ms to set off all async copies
   Num Inputs: 10. Took 9.772539 ms to set off all async copies
   ```
   
   My thoughts are: 
   
   (1) Can I move this overhead out of the main training thread and hide the latency the same way we hide data loading latency? I think to do that I would need something like the cuda IPC support that PyTorch has: https://github.com/pytorch/pytorch/blob/220ce8046e5fcf1434f948795bcdefda33e95e9a/torch/multiprocessing/reductions.py
   
   (2) Can I reduce the overhead associated with each `.as_in_context()` call, which I am *mostly* sure is an async call and is just suffering from high overhead. My thought there was to try the cython support, which at least some places on the internet suggest will have lower overhead when calling into the C API, but it looks like that is currently broken pending this patch: https://github.com/apache/incubator-mxnet/pull/10951
   
   Can someone look at this and let me know if I'm doing anything obviously wrong with the way I am trying to asynchronously copy data to each GPU? And is there an easy way to get faster CPU -> GPU copy times?  (Again, the actual data is relatively small and doing the copy is quick; the problem is the overhead from needing to call copy 8 (gpus) * 6 (num input arrays) = 48 times)
   
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services