You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/06/10 14:30:01 UTC

[jira] [Commented] (ARROW-10423) [C++] Filter compute function seems slow compared to numpy nonzero + take

    [ https://issues.apache.org/jira/browse/ARROW-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360969#comment-17360969 ] 

Antoine Pitrou commented on ARROW-10423:
----------------------------------------

Here is the current status here. It seems Arrow is now much faster than Numpy at filtering:
{code:python}
>>> %timeit arr.filter(mask1_pa)
20.4 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit arr.to_numpy()[mask1]
96.8 µs ± 605 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit arr.filter(mask2_pa)
134 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit arr.to_numpy()[mask2]
284 µs ± 545 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
{code}

And it's roughly as fast in taking:
{code:python}
>>> %timeit indices = np.nonzero(mask1)[0]; arr.take(indices)
159 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit indices = np.nonzero(mask1)[0]; arr.to_numpy()[indices]
142 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit indices = np.nonzero(mask2)[0]; arr.take(indices)
274 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit indices = np.nonzero(mask2)[0]; arr.to_numpy()[indices]
285 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
{code}


> [C++] Filter compute function seems slow compared to numpy nonzero + take
> -------------------------------------------------------------------------
>
>                 Key: ARROW-10423
>                 URL: https://issues.apache.org/jira/browse/ARROW-10423
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: compute
>
> From https://stackoverflow.com/questions/64581590/is-there-a-more-efficient-way-to-select-rows-from-a-pyarrow-table-based-on-conte
> I made a smaller, simplified example:
> {code:python}
> arr = pa.array(np.random.randn(1_000_000))
> # mask with only few True values
> mask1 = np.zeros(len(arr), dtype=bool)
> mask1[np.random.randint(len(arr), size=100)] = True
> mask1_pa = pa.array(mask1)
> # mask with larger proportion of True values
> mask2 = np.zeros(len(arr), dtype=bool)
> mask2[np.random.randint(len(arr), size=10_000)] = True
> mask2_pa = pa.array(mask2)
> {code}
> Doing timings of doing a Arrow {{Filter}} kernel vs using numpy to convert the mask into indices and then using a {{Take}} kernel:
> {code}
> # mask 1
> In [3]: %timeit arr.filter(mask1_pa)
> 132 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> In [4]: %%timeit
>    ...: indices = np.nonzero(mask1)[0]
>    ...: arr.take(indices)
> 114 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> # mask 2
> In [8]: %timeit arr.filter(mask2_pa)
> 711 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [9]: %%timeit
>    ...: indices = np.nonzero(mask2)[0]
>    ...: arr.take(indices)
> 333 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> So in the first case, both are quite similar in timing. But in the second case, the numpy+take version is faster. 
> I know this might depend on a lot on the actual proportion of True values and how they are positioned in the array (random vs concentrated) etc, so there is probably not a general rule of what should be faster. 
> But, it still seems a potential indication that things can be optimized in the Filter kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)