You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/09/09 00:59:00 UTC

[jira] [Resolved] (ARROW-9895) [RUST] Improve sort kernels

     [ https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Grove resolved ARROW-9895.
-------------------------------
    Fix Version/s: 2.0.0
       Resolution: Fixed

Issue resolved by pull request 8092
[https://github.com/apache/arrow/pull/8092]

> [RUST] Improve sort kernels
> ---------------------------
>
>                 Key: ARROW-9895
>                 URL: https://issues.apache.org/jira/browse/ARROW-9895
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>    Affects Versions: 1.0.0
>            Reporter: Jörn Horstmann
>            Assignee: Jörn Horstmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Followup from my mailing list post:
> {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
> {quote}
> My plan is to open a draft PR with the following changes:
>  - {{sort_to_indices}} further splits up float64/float32 inputs into nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices according to the sort options. Nans are distinct from null and sort greater than any other valid value
> - implement a sort method for dictionary arrays with string values. this kernel checks the {{is_ordered}} flag and sorts just by the keys if it is set, it will look up the string values otherwise
> - for the lexical sort use case the above kernel are not used, instead the {{OrdArray}} trait is used. To make that more flexible and allow wrapping arrays with differend ordering behavior I will make it no longer extend {{Array}} and instead only contain the {{cmp_value}} method
> - string dictionary sorting can then be implemented with a wrapper struct {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
> - NaN aware sorting of floats can also be implemented with a wrapper struct and trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)