You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/10/30 17:54:01 UTC

[jira] [Updated] (IMPALA-3357) Sorter's quicksort implementation is very suboptimal for duplicate keys

     [ https://issues.apache.org/jira/browse/IMPALA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong updated IMPALA-3357:
----------------------------------
    Issue Type: Improvement  (was: Bug)

> Sorter's quicksort implementation is very suboptimal for duplicate keys
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-3357
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3357
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.6.0
>            Reporter: Tim Armstrong
>            Priority: Minor
>              Labels: perf, ramp-up
>
> {code}
> void Sorter::TupleSorter::SortHelper(TupleIterator first, TupleIterator last) {
>   if (UNLIKELY(state_->is_cancelled())) return;
>   // Use insertion sort for smaller sequences.
>   while (last.index_ - first.index_ > INSERTION_THRESHOLD) {
>     TupleIterator iter(this, first.index_ + (last.index_ - first.index_) / 2);
>     DCHECK(iter.current_tuple_ != NULL);
>     // Partition() splits the tuples in [first, last) into two groups (<= pivot
>     // and >= pivot) in-place. 'cut' is the index of the first tuple in the second group.
>     TupleIterator cut = Partition(first, last,
>         reinterpret_cast<Tuple*>(iter.current_tuple_));
>     SortHelper(cut, last);
>     last = cut;
>     if (UNLIKELY(state_->is_cancelled())) return;
>   }
>   InsertionSort(first, last);
> }
> {code}
> The quicksort implementation in the sorter is based on dividing the input into two partitions: <= pivot and >= pivot.
> If all of the input values in a partition are equal, then it will still recursively divide and do insertion sort on the values. We could change the sorter to partition the input into three partitions: <, == and >. Then it doesn't need to recurse on the middle partition. This would mean it could sort a partition full of duplicate values in a single pass over the input.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org