You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kirill Lykov (Jira)" <ji...@apache.org> on 2021/06/09 11:25:00 UTC

[jira] [Comment Edited] (ARROW-10899) [C++] Investigate radix sort for integer arrays

    [ https://issues.apache.org/jira/browse/ARROW-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359990#comment-17359990 ] 

Kirill Lykov edited comment on ARROW-10899 at 6/9/21, 11:24 AM:
----------------------------------------------------------------

I have to drop this ticket but if someone will take it, I will be glad to assist.
Below is a summary of the research I did.

First of all I couldn't find ready to use stable radix sort implementation which is fast and generic.
boost::spread_sort is the best available implementation of in-place radix sort because it designed to perform well on all the inputs.
On the contrary, vanilla radix sort will perform poorly on some types of distributions as shown on the plot below.
spread_sort performs better for two reasons:
* it detect some cases when radix sort will perform poorly and relies on boost's pdqsort sorting algorithm
* it uses adaptive size of the radix rather than fixed

What makes spreadsort unstable is that it sometimes relies on pdqsort which is unstable and that it is in-place sort.
Hence, it is possible to make it stable by replacing calls to pdqsort to stable_sort and by using additional buffer instead of doing in-place sorting.

!differentdistrib.png|width=350,height=350!


was (Author: klykov):
I have to drop this ticket but if someone will take it, I will be glad to assist.
Below is a summary of the research I did.

First of all I couldn't find ready to use stable radix sort implementation which is fast and generic.
boost::spread_sort is the best available implementation of in-place radix sort because it designed to perform well on all the inputs.
On the contrary, vanilla radix sort will perform poorly on some types of distributions as shown on the plot below.
spread_sort performs better for two reasons:
* it detect some cases when radix sort will perform poorly and relies on boost's pdqsort sorting algorithm
* it uses adaptive size of the radix rather than fixed

What makes spreadsort unstable is that it sometimes relies on pdqsort which is unstable and that it is in-place sort.
Hence, it is possible to make it stable by replacing calls to pdqsort to stable_sort and by using additional buffer instead of doing in-place sorting.

!differentdistrib.png!

> [C++] Investigate radix sort for integer arrays
> -----------------------------------------------
>
>                 Key: ARROW-10899
>                 URL: https://issues.apache.org/jira/browse/ARROW-10899
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>            Reporter: Antoine Pitrou
>            Assignee: Kirill Lykov
>            Priority: Major
>         Attachments: Screen Shot 2021-02-09 at 17.48.13.png, Screen Shot 2021-02-10 at 10.58.23.png, all_random_wholeRange.png, all_random_wholeRange.png, all_random_wholeRange.png, differentdistrib.png
>
>
> For integer arrays with a non-tiny range of values, we currently use a stable sort. It may be faster to use a radix sort instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)