You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/14 05:48:30 UTC

[GitHub] [arrow] qsourav opened a new issue, #14629: [pyarrow] Regarding performance of filter operation

qsourav opened a new issue, #14629:
URL: https://github.com/apache/arrow/issues/14629

   Hi,
   
   It seems like the compute method, filter() has some performance issue, when the number of True values are very small in the input mask in comparison to the number of false values. It is even slower than pandas dataframe filter operation.
   
   Although filter() is slower, but the take() method seems to have better performance for those cases.
   
   Please consider the following experiemental code:
   ---
       import time
       import math
       import numpy as np
       
       import pyarrow as pa
       parr = pa.array(range(6161464))
       
       def get_pos(N, per):
           tot = math.ceil(N * per)
           step = N // tot
           ret = []
           for i in range(0, N, step):
               ret.append(i)
           return ret
       
       def eval(N):
           pers = [i * 0.003 for i in range(1, 30)]
           for i in pers:
               per = round(i * 100, 2)
               indx = get_pos(N, i)
       
               stime = time.time()
               ret = parr.take(indx)
               print("[per: {}%] pyarrow take time: {} sec".format(per, time.time() - stime))
   
               mask = np.asarray([False] * N)
               mask[indx] = True
               pa_mask = pa.array(mask)
               stime = time.time()
               ret = parr.filter(pa_mask)
               print("[per: {}%] pyarrow filter time: {} sec".format(per, time.time() - stime))
   
       eval(6161464)
   
   
   The take() method is significantly faster for the distribution of True values less than 5.1% in the input mask. 
   Filter starts showing better performance when the "True"-distribution exceeds 5.1%.
   
   Filter is a very frequently used method and having lesser True values is somewhat very common in data preprocessing.
   Thus it would be really great if the perfiormance issue of Filter() can be checked and improved. 
   
   Thanks,
   Sourav 
   ![take_vs_filter](https://user-images.githubusercontent.com/61731176/201584766-46951e69-42d7-4ec3-bac2-3ea19e8804c5.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org