You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/10/30 17:54:01 UTC
[jira] [Updated] (IMPALA-3357) Sorter's quicksort implementation is
very suboptimal for duplicate keys
[ https://issues.apache.org/jira/browse/IMPALA-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong updated IMPALA-3357:
----------------------------------
Issue Type: Improvement (was: Bug)
> Sorter's quicksort implementation is very suboptimal for duplicate keys
> -----------------------------------------------------------------------
>
> Key: IMPALA-3357
> URL: https://issues.apache.org/jira/browse/IMPALA-3357
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 2.6.0
> Reporter: Tim Armstrong
> Priority: Minor
> Labels: perf, ramp-up
>
> {code}
> void Sorter::TupleSorter::SortHelper(TupleIterator first, TupleIterator last) {
> if (UNLIKELY(state_->is_cancelled())) return;
> // Use insertion sort for smaller sequences.
> while (last.index_ - first.index_ > INSERTION_THRESHOLD) {
> TupleIterator iter(this, first.index_ + (last.index_ - first.index_) / 2);
> DCHECK(iter.current_tuple_ != NULL);
> // Partition() splits the tuples in [first, last) into two groups (<= pivot
> // and >= pivot) in-place. 'cut' is the index of the first tuple in the second group.
> TupleIterator cut = Partition(first, last,
> reinterpret_cast<Tuple*>(iter.current_tuple_));
> SortHelper(cut, last);
> last = cut;
> if (UNLIKELY(state_->is_cancelled())) return;
> }
> InsertionSort(first, last);
> }
> {code}
> The quicksort implementation in the sorter is based on dividing the input into two partitions: <= pivot and >= pivot.
> If all of the input values in a partition are equal, then it will still recursively divide and do insertion sort on the values. We could change the sorter to partition the input into three partitions: <, == and >. Then it doesn't need to recurse on the middle partition. This would mean it could sort a partition full of duplicate values in a single pass over the input.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org