You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/06 12:13:10 UTC

[GitHub] aseyboldt commented on issue #8219: Broadcasting ops are slow

aseyboldt commented on issue #8219: Broadcasting ops are slow
URL: https://github.com/apache/incubator-mxnet/issues/8219#issuecomment-349621811
 
 
   @cjolivier01 Thanks for looking into this. ?
   
   I haven't updated to mxnet 1.0 yet, so it is possible that this is fixed now (I only have slow internet at the moment so I can't update). Looking at the code I don't think so however.
   
   The broadcasting `array + array` shouldn't be much slower than plain `array + array`, especially if one of the arrays is smaller than the other, as that helps a lot with the memory bandwidth. Memory bandwidth should be the limiting factor in simple ops on large arrays. This can be seen when we compare to numpy:
   
   ```python
   import os
   os.environ['OMP_NUM_THREADS'] = '1'
   
   import numpy as np
   import mxnet as mx
   import time
   
   a = mx.sym.var('a')
   b = mx.sym.var('b')
   
   a_ = mx.nd.ones((2**17, 10, 10))
   b_ = mx.nd.ones((1,))
   c_ = a_.copy()
   
   x = a_.asnumpy()
   y = b_.asnumpy()
   z = c_.asnumpy()
   
   func1 = (a + b).bind(mx.cpu(), {'a': a_, 'b': c_})
   func2 = mx.sym.broadcast_add(a, b).bind(mx.cpu(), {'a': a_, 'b': b_})
   
   for _ in range(2):
       # elemwise
       start = time.time()
       for i in range(100):
           func1.forward()[0].wait_to_read()
       print("func1: {}".format(time.time() - start))
   
   
       # boadcast_add(array, array)
       start = time.time()
       for i in range(100):
           func2.forward()[0].wait_to_read()
       print("func2: {}".format(time.time() - start))
   
       # numpy elemwise
       start = time.time()
       out = np.zeros_like(x)
       for i in range(100):
           np.add(x, z, out=out)
       print("numpy1: {}".format(time.time() - start))
       
       # numpy broadcast
       start = time.time()
       for i in range(100):
           np.add(x, y, out=out)
       print("numpy2: {}".format(time.time() - start))
       
       print()
   ```
   
   which gives me (different machine than the last benchmark)
   ```
   func1: 0.9796142578125
   func2: 9.832738876342773
   numpy1: 0.9367139339447021
   numpy2: 0.6408178806304932
   
   func1: 0.927008867263794
   func2: 10.026437997817993
   numpy1: 1.091845989227295
   numpy2: 0.646554708480835
   ```
   
   For numpy the broadcasting op is *faster* than the normal one, for mxnet it is 10x slower.
   
   In the non-broadcasting case both numpy and mxnet are bound by memory bandwidth, and this is still more or less the case in the broadcasting case for numpy, but not for mxnet. This seems to happen in general for the broadcasting ops in mxnet, not only when a scalar is added. (Although numpy can't use up all the memory bandwidth in some cases either, it never slows down nearly as much as mxnet)
   
   My guess as to why func2 is so much slower than func1 is that the index juggling in `ravel` and `unravel` takes time and messes up prefetching. Other explanations could be that maybe some array is traversed in the wrong order (but I don't think this is the case) or that the branch because of `addto` slows things down (but I don't see how that would be a factor of 10).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services