You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/09 00:40:27 UTC
[GitHub] anirudh2290 commented on issue #8219: Broadcasting ops are slow

anirudh2290 commented on issue #8219: Broadcasting ops are slow
URL: https://github.com/apache/incubator-mxnet/issues/8219#issuecomment-395926419
 
 
   I have been looking at this issue. MXNet forward pass for broadcast_add is much faster than tensorflow and pytorch. To give some numbers here (Experiments conducted on my setup of p2.8xlarge, Testing for CPU performance):
   
   For broadcasting a tensor of shape (1,) to  a tensor of shape (2**17, 10, 10), only forward pass:
   pytorch: 0.6 seconds
   mxnet: 0.4 seconds
   tensorflow: 2.1 seconds.
   
   When we include both forward and backward pass: 
   pytorch:  5.1 seconds
   mxnet: 16 seconds
   tensorflow: 2.2 seconds
   
   So we decide to look at the MXNet backward pass and try out some optimizations. We try out using LaunchEx so that each thread gets a bigger chunk of workload. This by itself doesn't help. The bottleneck for the backward pass is the for loop that runs for each thread: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.h#L164
   
   There are a lot of repeated computations for computing coords and dot inside the for loop. We try to cache this computation. This improves the speed more than 2X to around 7.5 seconds for mxnet. This involves extra memory which is the drawback. You can see the rough implementation here: https://github.com/anirudh2290/mxnet/blob/cached_broadcast/src/operator/tensor/broadcast_reduce-inl.h#L233
   
   We suspect (not yet confirmed) that Tensorflow uses the eigen library to do the reduce. This is something that we should run experiments for and consider as a replacement as we deprecate mshadow.
   
   Next steps:
   1. Introduce the cached solution to MXNet as a stop gap solution.
   2. Investigate eigen library and whether it will be worthwhile to add it as a MXNet dependency.
   
   Here is the script. Please let me know if there is any caveat that I have missed:
   
   ```
   import numpy as np
   import mxnet as mx
   import time
   
   a = mx.sym.var('a')
   b = mx.sym.var('b')
   
   a_ = mx.nd.ones((2**17, 10, 10))
   b_ = mx.nd.ones((1,))
   
   func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), args={'a': a_, 'b': b_}, args_grad = {'a': mx.nd.ones((2**17, 10, 10)), 'b': mx.nd.ones((1))})
   
   for _ in range(4):
       # boadcast_add(array, array)
       start = time.time()
       for i in range(100):
           out = func2.forward(is_train=True)[0]
           func2.backward(mx.nd.ones((2**17, 10, 10)))
       mx.nd.waitall()
       print("mxnet time taken is: {}".format(time.time() - start))
   
   import torch
   import time
   
   for i in range(4):
   
       start = time.time()
       for j in range(100):
           x = torch.ones((2**17, 10, 10), requires_grad=True)
           y = torch.ones((1), requires_grad=True)
           z = x + y
           z.backward(torch.ones((2**17, 10, 10)), retain_graph=True)
       print("torch time taken is {}".format(time.time() - start))
   
   import tensorflow as tf
   import time
   a = tf.ones([2**17, 10, 10], name='a')
   b = tf.ones([1], name='b')
   add_op = a + b
   g = tf.gradients(add_op, [a,b])
   
   for x in range(4):
       with tf.Session() as session:
           start = time.time()
           for i in range(1, 100):
               grad_vals = session.run(g)
           print("tf time taken is: {}".format(time.time() - start))
   
   ```
   
   @piiswrong @andreaolgiati @srochel 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services