You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2021/04/06 08:55:00 UTC

[jira] [Commented] (ARROW-11568) [C++][Compute] Mode kernel performance is bad in some conditions

    [ https://issues.apache.org/jira/browse/ARROW-11568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315355#comment-17315355 ] 

Yibo Cai commented on ARROW-11568:
----------------------------------

scipy.stats.mode calls numpy.unique to do the jobs.
numpy.unique sorts the array and counts same values. It looks a better approach than arrow approach, which stores value counts to a map.
Both has O(n) space. Arrow's map approach only outperforms numpy when there are many duplicated values (> 100 for each value), which looks not useful.
I think numpy's sort and count approach is better.

> [C++][Compute] Mode kernel performance is bad in some conditions
> ----------------------------------------------------------------
>
>                 Key: ARROW-11568
>                 URL: https://issues.apache.org/jira/browse/ARROW-11568
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>
> Comparing with scipy.stats.mode, arrow mode kernel is much slower in some conditions. See below example.
> {noformat}
> In [1]: import numpy as np
> In [2]: import scipy.stats
> In [3]: import pyarrow.compute as pc
> In [4]: f = np.random.rand(12345678)
> In [5]: time scipy.stats.mode(f)
> CPU times: user 1.14 s, sys: 111 ms, total: 1.25 s
> Wall time: 1.25 s
> Out[5]: ModeResult(mode=array([2.25710692e-08]), count=array([1]))
> In [6]: time pc.mode(f)[0]
> CPU times: user 8.44 s, sys: 338 ms, total: 8.77 s
> Wall time: 8.77 s
> Out[6]: <pyarrow.StructScalar: {'mode': 2.2571069235866048e-08, 'count': 1}>
> In [7]: i = np.random.randint(0, 1234567, 12345678)
> In [8]: time scipy.stats.mode(i)
> CPU times: user 1.03 s, sys: 3.11 ms, total: 1.03 s
> Wall time: 1.03 s
> Out[8]: ModeResult(mode=array([607002]), count=array([28]))
> In [9]: time pc.mode(i)[0]
> CPU times: user 1.57 s, sys: 0 ns, total: 1.57 s
> Wall time: 1.57 s
> Out[9]: <pyarrow.StructScalar: {'mode': 607002, 'count': 28}>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)