You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Raphael Taylor-Davies <r....@googlemail.com.INVALID> on 2023/01/22 15:12:32 UTC

RunEndEncodedArray Null Counts

Hi,

Apologies if I am rehashing something that has already been discussed or 
is documented elsewhere, but reading the documentation of the Run-Length 
encoding [1] I noticed that the parent null count can be non-zero [2].

This is somewhat surprising to me for a couple of reasons:

- This is inconsistent with how it is handled for other nested types 
like dictionaries, structs, etc... where a null count is solely the 
number of nulls in the mask of that Array
- Codepaths that use null counts to infer validity mask properties such 
as presence, bit counts, etc... will no longer work
- This null count can only be recomputed in the context of the run-ends, 
implying codepaths that slice ArrayData or otherwise manipulate 
ArrayData directly must be run-length aware

This leads to a couple of questions

- Is this a documentation mistake or is the null count of RunEndEncoded 
ArrayData determined by its children
- Can a RunEndEncoded ArrayData contain a null mask itself, 
independently of its runs, much like dictionary arrays can

Any clarifications would be most welcome

[1]: 
https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout
[2]: https://github.com/apache/arrow/pull/13333/files#r1083470362


Re: RunEndEncodedArray Null Counts

Posted by Andrew Lamb <al...@influxdata.com>.
To complete this thread, the documentation has been updated to clarify the
intent[1].

Thank you all very much,
Andrew

[1] https://github.com/apache/arrow/pull/33831

On Mon, Jan 23, 2023 at 8:32 AM Raphael Taylor-Davies
<r....@googlemail.com.invalid> wrote:

> Hi Tobias,
>
> Thank you for clarifying this, makes sense to me
>
> Kind Regards,
>
> Raphael
>
> On 22/01/2023 16:15, Tobias Zagorni wrote:
> > Hi Raphael,
> >
> > I think this is indeed a documentation mistake, it should say 0!
> >
> > For exeactly these reasons you mentioned I determined that it is best
> > to leave the null count field always 0 for RLE arrays. This way it is
> > consistent with union types, at least.
> >
> > RunLengthEncoded data should not contain a null mask by itself. The
> > idea so far is that Null is just one of the possible values for a run.
> >
> > (if we were to allow the RLE array parent to have an additional null
> > mask, the null count field would represent that - there seems to be a
> > generall assumption in Arrow code that a non-zero (or array length for
> > the NULL) null count means the presence of the standard null mask)
> >
> > Best,
> > Tobias
> >
> > On 2023/01/22 15:12:32 Raphael Taylor-Davies wrote:
> >> Hi,
> >>
> >> Apologies if I am rehashing something that has already been discussed
> > or
> >> is documented elsewhere, but reading the documentation of the Run-
> > Length
> >> encoding [1] I noticed that the parent null count can be non-zero
> > [2].
> >> This is somewhat surprising to me for a couple of reasons:
> >>
> >> - This is inconsistent with how it is handled for other nested types
> >> like dictionaries, structs, etc... where a null count is solely the
> >> number of nulls in the mask of that Array
> >> - Codepaths that use null counts to infer validity mask properties
> > such
> >> as presence, bit counts, etc... will no longer work
> >> - This null count can only be recomputed in the context of the run-
> > ends,
> >> implying codepaths that slice ArrayData or otherwise manipulate
> >> ArrayData directly must be run-length aware
> >>
> >> This leads to a couple of questions
> >>
> >> - Is this a documentation mistake or is the null count of
> > RunEndEncoded
> >> ArrayData determined by its children
> >> - Can a RunEndEncoded ArrayData contain a null mask itself,
> >> independently of its runs, much like dictionary arrays can
> >>
> >> Any clarifications would be most welcome
> >>
> >> [1]:
> >>
> >
> https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout
> >> [2]: https://github.com/apache/arrow/pull/13333/files#r1083470362
> >>
> >>
>

Re: RunEndEncodedArray Null Counts

Posted by Raphael Taylor-Davies <r....@googlemail.com.INVALID>.
Hi Tobias,

Thank you for clarifying this, makes sense to me

Kind Regards,

Raphael

On 22/01/2023 16:15, Tobias Zagorni wrote:
> Hi Raphael,
>
> I think this is indeed a documentation mistake, it should say 0!
>
> For exeactly these reasons you mentioned I determined that it is best
> to leave the null count field always 0 for RLE arrays. This way it is
> consistent with union types, at least.
>
> RunLengthEncoded data should not contain a null mask by itself. The
> idea so far is that Null is just one of the possible values for a run.
>
> (if we were to allow the RLE array parent to have an additional null
> mask, the null count field would represent that - there seems to be a
> generall assumption in Arrow code that a non-zero (or array length for
> the NULL) null count means the presence of the standard null mask)
>
> Best,
> Tobias
>
> On 2023/01/22 15:12:32 Raphael Taylor-Davies wrote:
>> Hi,
>>
>> Apologies if I am rehashing something that has already been discussed
> or
>> is documented elsewhere, but reading the documentation of the Run-
> Length
>> encoding [1] I noticed that the parent null count can be non-zero
> [2].
>> This is somewhat surprising to me for a couple of reasons:
>>
>> - This is inconsistent with how it is handled for other nested types
>> like dictionaries, structs, etc... where a null count is solely the
>> number of nulls in the mask of that Array
>> - Codepaths that use null counts to infer validity mask properties
> such
>> as presence, bit counts, etc... will no longer work
>> - This null count can only be recomputed in the context of the run-
> ends,
>> implying codepaths that slice ArrayData or otherwise manipulate
>> ArrayData directly must be run-length aware
>>
>> This leads to a couple of questions
>>
>> - Is this a documentation mistake or is the null count of
> RunEndEncoded
>> ArrayData determined by its children
>> - Can a RunEndEncoded ArrayData contain a null mask itself,
>> independently of its runs, much like dictionary arrays can
>>
>> Any clarifications would be most welcome
>>
>> [1]:
>>
> https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout
>> [2]: https://github.com/apache/arrow/pull/13333/files#r1083470362
>>
>>

RE: RunEndEncodedArray Null Counts

Posted by Tobias Zagorni <to...@zagorni.eu.INVALID>.
Hi Raphael,

I think this is indeed a documentation mistake, it should say 0!

For exeactly these reasons you mentioned I determined that it is best
to leave the null count field always 0 for RLE arrays. This way it is
consistent with union types, at least.

RunLengthEncoded data should not contain a null mask by itself. The
idea so far is that Null is just one of the possible values for a run. 

(if we were to allow the RLE array parent to have an additional null
mask, the null count field would represent that - there seems to be a
generall assumption in Arrow code that a non-zero (or array length for
the NULL) null count means the presence of the standard null mask) 

Best,
Tobias 

On 2023/01/22 15:12:32 Raphael Taylor-Davies wrote:
> Hi,
> 
> Apologies if I am rehashing something that has already been discussed
or 
> is documented elsewhere, but reading the documentation of the Run-
Length 
> encoding [1] I noticed that the parent null count can be non-zero
[2].
> 
> This is somewhat surprising to me for a couple of reasons:
> 
> - This is inconsistent with how it is handled for other nested types 
> like dictionaries, structs, etc... where a null count is solely the 
> number of nulls in the mask of that Array
> - Codepaths that use null counts to infer validity mask properties
such 
> as presence, bit counts, etc... will no longer work
> - This null count can only be recomputed in the context of the run-
ends, 
> implying codepaths that slice ArrayData or otherwise manipulate 
> ArrayData directly must be run-length aware
> 
> This leads to a couple of questions
> 
> - Is this a documentation mistake or is the null count of
RunEndEncoded 
> ArrayData determined by its children
> - Can a RunEndEncoded ArrayData contain a null mask itself, 
> independently of its runs, much like dictionary arrays can
> 
> Any clarifications would be most welcome
> 
> [1]: 
>
https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout
> [2]: https://github.com/apache/arrow/pull/13333/files#r1083470362
> 
>