You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/09 00:40:27 UTC
[GitHub] anirudh2290 commented on issue #8219: Broadcasting ops are slow
anirudh2290 commented on issue #8219: Broadcasting ops are slow
URL: https://github.com/apache/incubator-mxnet/issues/8219#issuecomment-395926419
I have been looking at this issue. MXNet forward pass for broadcast_add is much faster than tensorflow and pytorch. To give some numbers here (Experiments conducted on my setup of p2.8xlarge, Testing for CPU performance):
For broadcasting a tensor of shape (1,) to a tensor of shape (2**17, 10, 10), only forward pass:
pytorch: 0.6 seconds
mxnet: 0.4 seconds
tensorflow: 2.1 seconds.
When we include both forward and backward pass:
pytorch: 5.1 seconds
mxnet: 16 seconds
tensorflow: 2.2 seconds
So we decide to look at the MXNet backward pass and try out some optimizations. We try out using LaunchEx so that each thread gets a bigger chunk of workload. This by itself doesn't help. The bottleneck for the backward pass is the for loop that runs for each thread: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.h#L164
There are a lot of repeated computations for computing coords and dot inside the for loop. We try to cache this computation. This improves the speed more than 2X to around 7.5 seconds for mxnet. This involves extra memory which is the drawback. You can see the rough implementation here: https://github.com/anirudh2290/mxnet/blob/cached_broadcast/src/operator/tensor/broadcast_reduce-inl.h#L233
We suspect (not yet confirmed) that Tensorflow uses the eigen library to do the reduce. This is something that we should run experiments for and consider as a replacement as we deprecate mshadow.
Next steps:
1. Introduce the cached solution to MXNet as a stop gap solution.
2. Investigate eigen library and whether it will be worthwhile to add it as a MXNet dependency.
Here is the script. Please let me know if there is any caveat that I have missed:
```
import numpy as np
import mxnet as mx
import time
a = mx.sym.var('a')
b = mx.sym.var('b')
a_ = mx.nd.ones((2**17, 10, 10))
b_ = mx.nd.ones((1,))
func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), args={'a': a_, 'b': b_}, args_grad = {'a': mx.nd.ones((2**17, 10, 10)), 'b': mx.nd.ones((1))})
for _ in range(4):
# boadcast_add(array, array)
start = time.time()
for i in range(100):
out = func2.forward(is_train=True)[0]
func2.backward(mx.nd.ones((2**17, 10, 10)))
mx.nd.waitall()
print("mxnet time taken is: {}".format(time.time() - start))
import torch
import time
for i in range(4):
start = time.time()
for j in range(100):
x = torch.ones((2**17, 10, 10), requires_grad=True)
y = torch.ones((1), requires_grad=True)
z = x + y
z.backward(torch.ones((2**17, 10, 10)), retain_graph=True)
print("torch time taken is {}".format(time.time() - start))
import tensorflow as tf
import time
a = tf.ones([2**17, 10, 10], name='a')
b = tf.ones([1], name='b')
add_op = a + b
g = tf.gradients(add_op, [a,b])
for x in range(4):
with tf.Session() as session:
start = time.time()
for i in range(1, 100):
grad_vals = session.run(g)
print("tf time taken is: {}".format(time.time() - start))
```
@piiswrong @andreaolgiati @srochel
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services