You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/10 02:12:08 UTC
[GitHub] [arrow] cyb70289 commented on pull request #8524: ARROW-10345: [C++][Compute] Fix NaN error in sorting and topn kernels
cyb70289 commented on pull request #8524:
URL: https://github.com/apache/arrow/pull/8524#issuecomment-724403787
> Is "sort_indices" actually stable? The [documentation](https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions) says "non-stable".
In current code, "sort_indices" is stable, it uses "std::stable_sort". "partition_nth_indices" is not stable, it uses "std::nth_element".
[numpy.argsort](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) has a parameter to select sorting algorithm, maybe arrow can also implement unstable quicksort (std::sort) and provide a kernel option to select.
```
# comparison of numpy stable and unstable sort
# quicksort gives better performance
In [1]: import numpy as np
In [2]: a = np.random.randint(0, 100, 12345678)
In [3]: time i1 = np.argsort(a, kind='quicksort')
CPU times: user 909 ms, sys: 4.67 ms, total: 914 ms
Wall time: 913 ms
In [4]: time i2 = np.argsort(a, kind='stable')
CPU times: user 998 ms, sys: 20.8 ms, total: 1.02 s
Wall time: 1.02 s
# stable and quick sort gives different result
In [5]: (i1 == i2).all()
Out[5]: False
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org