You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jörn Horstmann (Jira)" <ji...@apache.org> on 2020/09/01 17:14:00 UTC
[jira] [Created] (ARROW-9895) [RUST] Improve sort kernels

Jörn Horstmann created ARROW-9895:
-------------------------------------

             Summary: [RUST] Improve sort kernels
                 Key: ARROW-9895
                 URL: https://issues.apache.org/jira/browse/ARROW-9895
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust
    Affects Versions: 1.0.0
            Reporter: Jörn Horstmann


Followup from my mailing list post:
{quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32
and Float64 data types are not supported because the implementation
relies on the OrdArray trait. This trait is not implemented because
f64/f32 only implements PartialOrd. The sort function for a single
column (sort_to_indices) has some special logic which looks like it
wants to treats NaN the same as null, but I'm also not convinced this
is the correct way. For example postgres does the following
(https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)

"In order to allow floating-point values to be sorted and used in
tree-based indexes, PostgreSQL treats NaN values as equal, and greater
than all non-NaN values."

I propose to do the same in an OrdArray impl for
Float64Array/Float32Array and then simplifying the sort_to_indices
function accordingly.

2. Sorting for dictionary encoded strings. The problem here is that
DictionaryArray does not have a generic parameter for the value type
so it is not currently possible to only implement OrdArray for string
dictionaries. Again for the single column case, the value data type
could be checked and a sort could be implemented by looking up each
key in the dictionary. An optimization could be to check the is_sorted
flag of DictionaryArray (which does not seem to be used really) and
then directly sort by the keys. For the general case I see roughly to
options

- Somehow implement an OrdArray view of the dictionary array. This
could be easier if OrdArray did not extend Array but was a completely
separate trait.
- Change the lexicographic sort impl to not use dynamic calls but
instead sort multiple times. So for a query `ORDER BY a, b`, first
sort by b and afterwards sort again by a. With a stable sort
implementation this should result in the same ordering. I'm curious
about the performance, it could avoid dynamic method calls for each
comparison, but it would process the indices vector multiple times.
{quote}

My plan is to open a draft PR with the following changes:

 - {{sort_to_indices}} further splits up float64/float32 inputs into nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices according to the sort options. Nans are distinct from null and sort greater than any other valid value
- implement a sort method for dictionary arrays with string values. this kernel checks the {{is_ordered}} flag and sorts just by the keys if it is set, it will look up the string values otherwise
- for the lexical sort use case the above kernel are not used, instead the {{OrdArray}} trait is used. To make that more flexible and allow wrapping arrays with differend ordering behavior I will make it no longer extend {{Array}} and instead only contain the {{cmp_value}} method
- string dictionary sorting can then be implemented with a wrapper struct {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}}
- NaN aware sorting of floats can also be implemented with a wrapper struct and trait implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)