You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/29 16:26:18 UTC
[GitHub] [incubator-mxnet] access2rohit commented on a change in pull request #18168: Separate GPU kernel for broadcast_axis

access2rohit commented on a change in pull request #18168:
URL: https://github.com/apache/incubator-mxnet/pull/18168#discussion_r417447128



##########
File path: src/operator/tensor/broadcast_reduce_op.h
##########
@@ -1077,6 +1077,42 @@ struct broadcast_kernel {
   }
 };
 
+namespace {
+struct shape_and_stride {
+  index_t in_stride[MXNET_SPECIAL_MAX_NDIM];
+  index_t out_stride[MXNET_SPECIAL_MAX_NDIM];
+  index_t input_shape[MXNET_SPECIAL_MAX_NDIM];
+  index_t output_shape[MXNET_SPECIAL_MAX_NDIM];
+};

Review comment:
       @sxjscience 
   The max value they can take is 5. Even if these arrays are filled upto capacity of 1 only there isn't much effect on the performance.
   
   perf bench marking code:
   ```
   import mxnet as mx
   import mxnet.ndarray as nd
   from benchmark.opperf.utils.benchmark_utils import run_performance_test
   
   local_ctx = mx.gpu()
   op = nd.broadcast_axis
   
   add_res = run_performance_test(op, run_backward=True, dtype='float32', ctx=local_ctx,
                                  inputs=[{'data': (128, 1), 'axis': (1), 'size': (128)}],
                                  warmup=50, runs=500, profiler='python')
   print(add_res)
   ```
   
   Results:
   
   My PR:
   ```
   INFO:root:Begin Benchmark - broadcast_axis
   INFO:root:Complete Benchmark - broadcast_axis
   [{'broadcast_axis': [{'inputs': {'data': (128, 1), 'axis': 1, 'size': 128}, 'avg_time_broadcast_axis': 0.23971764800000983, 'p50_time_broadcast_axis': 0.237886499988349, 'p90_time_broadcast_axis': 0.24467039999080953, 'p99_time_broadcast_axis': 0.26128606997787074}]}]
   ```
   master
   ```
   INFO:root:Begin Benchmark - broadcast_axis
   INFO:root:Complete Benchmark - broadcast_axis
   [{'broadcast_axis': [{'inputs': {'data': (128, 1), 'axis': 1, 'size': 32}, 'avg_time_broadcast_axis': 0.2699070239946195, 'p50_time_broadcast_axis': 0.2669120001428382, 'p90_time_broadcast_axis': 0.27620860007573356, 'p99_time_broadcast_axis': 0.35185581999940035}]}]
   ```
   
   Its still faster than master
   
   > > > Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.
   > > 
   > > 
   > > @apeforest
   > > 
   > > * We cannot use typedef based on xpu because typdef is compile time option.
   > > * The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.
   > > 
   > > Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.
   > > Again typedef is compile time and cannot be dynamically decided.
   > > Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !
   > 
   > I did not mean `typedef` the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using `int` instead of `index_t`
   
   @apeforest : Template could be used to fuse 2 kernels together but this aproach will not improve CPU performance as much as this: https://github.com/apache/incubator-mxnet/pull/17882. Since there I am actually using vectorization. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org