You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Jörn Horstmann <jo...@signavio.com> on 2020/08/28 14:33:41 UTC

Rust/Datafusion sort kernel issues

I ran into a few issues with the rust sort kernels and would like to
discuss possible solutions.

1. When sorting by multiple columns (lexsort_to_indices) the Float32
and Float64 data types are not supported because the implementation
relies on the OrdArray trait. This trait is not implemented because
f64/f32 only implements PartialOrd. The sort function for a single
column (sort_to_indices) has some special logic which looks like it
wants to treats NaN the same as null, but I'm also not convinced this
is the correct way. For example postgres does the following
(https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)

"In order to allow floating-point values to be sorted and used in
tree-based indexes, PostgreSQL treats NaN values as equal, and greater
than all non-NaN values."

I propose to do the same in an OrdArray impl for
Float64Array/Float32Array and then simplifying the sort_to_indices
function accordingly.

2. Sorting for dictionary encoded strings. The problem here is that
DictionaryArray does not have a generic parameter for the value type
so it is not currently possible to only implement OrdArray for string
dictionaries. Again for the single column case, the value data type
could be checked and a sort could be implemented by looking up each
key in the dictionary. An optimization could be to check the is_sorted
flag of DictionaryArray (which does not seem to be used really) and
then directly sort by the keys. For the general case I see roughly to
options

- Somehow implement an OrdArray view of the dictionary array. This
could be easier if OrdArray did not extend Array but was a completely
separate trait.
- Change the lexicographic sort impl to not use dynamic calls but
instead sort multiple times. So for a query `ORDER BY a, b`, first
sort by b and afterwards sort again by a. With a stable sort
implementation this should result in the same ordering. I'm curious
about the performance, it could avoid dynamic method calls for each
comparison, but it would process the indices vector multiple times.

Looking forward to any other suggestions or feedback.

-- 
Jörn Horstmann | Senior Backend Engineer

www.signavio.com
Kurfürstenstraße 111, 10787 Berlin, Germany

Re: Floating-point order

Posted by Matthias Vallentin <ma...@vallentin.net>.

I agree that users should have the capability to determine their own way,
but with Arrow going more into the direction of providing compute
building blocks (kernels), a choice must be made, e.g., when it comes to
sorting, computing a mean, etc.

Certainly -Inf and Inf are easy to put in the row, but NaN is the odd one
out. At the very left, right, between -0 and 0? There are many
possibilities, and I'm sure one could argue in depth about what makes most
sense. I think as long as the total order is explicit, users can work with
it.

I'm not sure how easy it will be to give the user choice wrt. to the
ordering. If the API allows for specifying a custom comparison function,
then the flexibility is there. Would that be possible with Gandiva?

On Wed, Sep 2, 2020 at 11:50 AM Antoine Pitrou <an...@python.org> wrote:

>
> Well, Inf and -Inf are already ordered ;-)  Nan is, as usual, a can of
> worms.  Ordering probably doesn't belong in the Arrow spec (which is
> only concerned with representing data, not processing it).
>
> In any case, I agree that it makes sense to handle all NaNs as equal
> when implementing comparison-based functions: we're trying to do that on
> the C++ side.  It also makes sense to choose a well-defined ordering for
> them (for example "consider all NaNs smaller than non-NaNs", or larger).
>
> In some cases, you may also want to let the user alter the behaviour.
> For example, when summing an array, you may want a NaN in the input to
> force the result to NaN (as the IEEE spec would say), or you may want
> NaNs to be ignored.  NumPy has two functions (`sum` and `nansum`) for
> these two different behaviours.
>
> Regards
>
> Antoine.
>
>
> Le 02/09/2020 à 11:40, Matthias Vallentin a écrit :
> > Would it perhaps make sense to define the total order for non-numbers
> (NaN,
> > Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
> > behavior is the same across all languages?
> >
> > On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <an...@gmail.com>
> wrote:
> >
> >> Hi Jörn,
> >>
> >> I agree with your concerns about NaN. There was a discussion about this
> in
> >> https://github.com/apache/arrow/pull/7193
> >>
> >> I will try and make time this weekend to look at the current
> implementation
> >> and your suggestions around DictionaryArray.
> >>
> >> Hopefully, other contributors that are more familiar with that code can
> >> respond as well.
> >>
> >> Thanks,
> >>
> >> Andy.
> >>
> >>
> >>
> >> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
> >> joern.horstmann@signavio.com>
> >> wrote:
> >>
> >>> I ran into a few issues with the rust sort kernels and would like to
> >>> discuss possible solutions.
> >>>
> >>> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> >>> and Float64 data types are not supported because the implementation
> >>> relies on the OrdArray trait. This trait is not implemented because
> >>> f64/f32 only implements PartialOrd. The sort function for a single
> >>> column (sort_to_indices) has some special logic which looks like it
> >>> wants to treats NaN the same as null, but I'm also not convinced this
> >>> is the correct way. For example postgres does the following
> >>> (
> https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
> >> )
> >>>
> >>> "In order to allow floating-point values to be sorted and used in
> >>> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> >>> than all non-NaN values."
> >>>
> >>> I propose to do the same in an OrdArray impl for
> >>> Float64Array/Float32Array and then simplifying the sort_to_indices
> >>> function accordingly.
> >>>
> >>> 2. Sorting for dictionary encoded strings. The problem here is that
> >>> DictionaryArray does not have a generic parameter for the value type
> >>> so it is not currently possible to only implement OrdArray for string
> >>> dictionaries. Again for the single column case, the value data type
> >>> could be checked and a sort could be implemented by looking up each
> >>> key in the dictionary. An optimization could be to check the is_sorted
> >>> flag of DictionaryArray (which does not seem to be used really) and
> >>> then directly sort by the keys. For the general case I see roughly to
> >>> options
> >>>
> >>> - Somehow implement an OrdArray view of the dictionary array. This
> >>> could be easier if OrdArray did not extend Array but was a completely
> >>> separate trait.
> >>> - Change the lexicographic sort impl to not use dynamic calls but
> >>> instead sort multiple times. So for a query `ORDER BY a, b`, first
> >>> sort by b and afterwards sort again by a. With a stable sort
> >>> implementation this should result in the same ordering. I'm curious
> >>> about the performance, it could avoid dynamic method calls for each
> >>> comparison, but it would process the indices vector multiple times.
> >>>
> >>> Looking forward to any other suggestions or feedback.
> >>>
> >>> --
> >>> Jörn Horstmann | Senior Backend Engineer
> >>>
> >>> www.signavio.com
> >>> Kurfürstenstraße 111, 10787 Berlin, Germany
> >>>
> >>
> >
>

Re: Floating-point order

Posted by Antoine Pitrou <an...@python.org>.

Well, Inf and -Inf are already ordered ;-)  Nan is, as usual, a can of
worms.  Ordering probably doesn't belong in the Arrow spec (which is
only concerned with representing data, not processing it).

In any case, I agree that it makes sense to handle all NaNs as equal
when implementing comparison-based functions: we're trying to do that on
the C++ side.  It also makes sense to choose a well-defined ordering for
them (for example "consider all NaNs smaller than non-NaNs", or larger).

In some cases, you may also want to let the user alter the behaviour.
For example, when summing an array, you may want a NaN in the input to
force the result to NaN (as the IEEE spec would say), or you may want
NaNs to be ignored.  NumPy has two functions (`sum` and `nansum`) for
these two different behaviours.

Regards

Antoine.


Le 02/09/2020 à 11:40, Matthias Vallentin a écrit :
> Would it perhaps make sense to define the total order for non-numbers (NaN,
> Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
> behavior is the same across all languages?
> 
> On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <an...@gmail.com> wrote:
> 
>> Hi Jörn,
>>
>> I agree with your concerns about NaN. There was a discussion about this in
>> https://github.com/apache/arrow/pull/7193
>>
>> I will try and make time this weekend to look at the current implementation
>> and your suggestions around DictionaryArray.
>>
>> Hopefully, other contributors that are more familiar with that code can
>> respond as well.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>
>> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
>> joern.horstmann@signavio.com>
>> wrote:
>>
>>> I ran into a few issues with the rust sort kernels and would like to
>>> discuss possible solutions.
>>>
>>> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
>>> and Float64 data types are not supported because the implementation
>>> relies on the OrdArray trait. This trait is not implemented because
>>> f64/f32 only implements PartialOrd. The sort function for a single
>>> column (sort_to_indices) has some special logic which looks like it
>>> wants to treats NaN the same as null, but I'm also not convinced this
>>> is the correct way. For example postgres does the following
>>> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
>> )
>>>
>>> "In order to allow floating-point values to be sorted and used in
>>> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
>>> than all non-NaN values."
>>>
>>> I propose to do the same in an OrdArray impl for
>>> Float64Array/Float32Array and then simplifying the sort_to_indices
>>> function accordingly.
>>>
>>> 2. Sorting for dictionary encoded strings. The problem here is that
>>> DictionaryArray does not have a generic parameter for the value type
>>> so it is not currently possible to only implement OrdArray for string
>>> dictionaries. Again for the single column case, the value data type
>>> could be checked and a sort could be implemented by looking up each
>>> key in the dictionary. An optimization could be to check the is_sorted
>>> flag of DictionaryArray (which does not seem to be used really) and
>>> then directly sort by the keys. For the general case I see roughly to
>>> options
>>>
>>> - Somehow implement an OrdArray view of the dictionary array. This
>>> could be easier if OrdArray did not extend Array but was a completely
>>> separate trait.
>>> - Change the lexicographic sort impl to not use dynamic calls but
>>> instead sort multiple times. So for a query `ORDER BY a, b`, first
>>> sort by b and afterwards sort again by a. With a stable sort
>>> implementation this should result in the same ordering. I'm curious
>>> about the performance, it could avoid dynamic method calls for each
>>> comparison, but it would process the indices vector multiple times.
>>>
>>> Looking forward to any other suggestions or feedback.
>>>
>>> --
>>> Jörn Horstmann | Senior Backend Engineer
>>>
>>> www.signavio.com
>>> Kurfürstenstraße 111, 10787 Berlin, Germany
>>>
>>
>

Re: Rust/Datafusion sort kernel issues

Posted by Matthias Vallentin <ma...@vallentin.net>.

Would it perhaps make sense to define the total order for non-numbers (NaN,
Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
behavior is the same across all languages?

On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <an...@gmail.com> wrote:

> Hi Jörn,
>
> I agree with your concerns about NaN. There was a discussion about this in
> https://github.com/apache/arrow/pull/7193
>
> I will try and make time this weekend to look at the current implementation
> and your suggestions around DictionaryArray.
>
> Hopefully, other contributors that are more familiar with that code can
> respond as well.
>
> Thanks,
>
> Andy.
>
>
>
> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
> joern.horstmann@signavio.com>
> wrote:
>
> > I ran into a few issues with the rust sort kernels and would like to
> > discuss possible solutions.
> >
> > 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> > and Float64 data types are not supported because the implementation
> > relies on the OrdArray trait. This trait is not implemented because
> > f64/f32 only implements PartialOrd. The sort function for a single
> > column (sort_to_indices) has some special logic which looks like it
> > wants to treats NaN the same as null, but I'm also not convinced this
> > is the correct way. For example postgres does the following
> > (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
> )
> >
> > "In order to allow floating-point values to be sorted and used in
> > tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> > than all non-NaN values."
> >
> > I propose to do the same in an OrdArray impl for
> > Float64Array/Float32Array and then simplifying the sort_to_indices
> > function accordingly.
> >
> > 2. Sorting for dictionary encoded strings. The problem here is that
> > DictionaryArray does not have a generic parameter for the value type
> > so it is not currently possible to only implement OrdArray for string
> > dictionaries. Again for the single column case, the value data type
> > could be checked and a sort could be implemented by looking up each
> > key in the dictionary. An optimization could be to check the is_sorted
> > flag of DictionaryArray (which does not seem to be used really) and
> > then directly sort by the keys. For the general case I see roughly to
> > options
> >
> > - Somehow implement an OrdArray view of the dictionary array. This
> > could be easier if OrdArray did not extend Array but was a completely
> > separate trait.
> > - Change the lexicographic sort impl to not use dynamic calls but
> > instead sort multiple times. So for a query `ORDER BY a, b`, first
> > sort by b and afterwards sort again by a. With a stable sort
> > implementation this should result in the same ordering. I'm curious
> > about the performance, it could avoid dynamic method calls for each
> > comparison, but it would process the indices vector multiple times.
> >
> > Looking forward to any other suggestions or feedback.
> >
> > --
> > Jörn Horstmann | Senior Backend Engineer
> >
> > www.signavio.com
> > Kurfürstenstraße 111, 10787 Berlin, Germany
> >
>

Re: Rust/Datafusion sort kernel issues

Posted by Andy Grove <an...@gmail.com>.

Hi Jörn,

I agree with your concerns about NaN. There was a discussion about this in
https://github.com/apache/arrow/pull/7193

I will try and make time this weekend to look at the current implementation
and your suggestions around DictionaryArray.

Hopefully, other contributors that are more familiar with that code can
respond as well.

Thanks,

Andy.



On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <jo...@signavio.com>
wrote:

> I ran into a few issues with the rust sort kernels and would like to
> discuss possible solutions.
>
> 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> and Float64 data types are not supported because the implementation
> relies on the OrdArray trait. This trait is not implemented because
> f64/f32 only implements PartialOrd. The sort function for a single
> column (sort_to_indices) has some special logic which looks like it
> wants to treats NaN the same as null, but I'm also not convinced this
> is the correct way. For example postgres does the following
> (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)
>
> "In order to allow floating-point values to be sorted and used in
> tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> than all non-NaN values."
>
> I propose to do the same in an OrdArray impl for
> Float64Array/Float32Array and then simplifying the sort_to_indices
> function accordingly.
>
> 2. Sorting for dictionary encoded strings. The problem here is that
> DictionaryArray does not have a generic parameter for the value type
> so it is not currently possible to only implement OrdArray for string
> dictionaries. Again for the single column case, the value data type
> could be checked and a sort could be implemented by looking up each
> key in the dictionary. An optimization could be to check the is_sorted
> flag of DictionaryArray (which does not seem to be used really) and
> then directly sort by the keys. For the general case I see roughly to
> options
>
> - Somehow implement an OrdArray view of the dictionary array. This
> could be easier if OrdArray did not extend Array but was a completely
> separate trait.
> - Change the lexicographic sort impl to not use dynamic calls but
> instead sort multiple times. So for a query `ORDER BY a, b`, first
> sort by b and afterwards sort again by a. With a stable sort
> implementation this should result in the same ordering. I'm curious
> about the performance, it could avoid dynamic method calls for each
> comparison, but it would process the indices vector multiple times.
>
> Looking forward to any other suggestions or feedback.
>
> --
> Jörn Horstmann | Senior Backend Engineer
>
> www.signavio.com
> Kurfürstenstraße 111, 10787 Berlin, Germany
>