You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2020/02/09 06:53:29 UTC

[Format] Dictionary edge cases (encoding nulls and nested dictionaries)

I'd like to understand if any one is making use of the following features
and if we should revisit them before 1.0.

1. Dictionaries can encode null values.
- This become error prone for things like parquet.  We seem to be
calculating the definition level solely based on the null bitmap.

I might have missed something but it appears that we only check if a
dictionary contains nulls on the optimized path [1] but not when converting
the dictionary array back to dense, so I think the values written could get
out of sync with the rep/def levels?

It seems we should potentially disallow dictionaries to contain null
values?

2.  Dictionaries can nested columns which are in turn dictionary encoded
columns.

- Again we aren't handling this in Parquet today, and I'm wondering if it
worth the effort.
There was a PR merged a while ago [2] to add a "skipped" integration test
but it doesn't look like anyone has done follow-up work to make enable
this/make it pass.

It seems simpler to keep dictionary encoding at the leafs of the schema.

Of the two I'm a little more worried that Option #1 will break people if we
decide to disallow it.

Thoughts?

Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
[2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Feb 18, 2020 at 2:01 AM Micah Kornfield <em...@gmail.com> wrote:
>
>
>> * evaluating an expression like SUM(ISNULL($field)) is more
>> semantically ambiguous (you have to check more things) when $field is
>> a dictionary-encoded type and the values of the dictionary could be
>> null
>
> It is this type of thing that I'm worried about (parquet just happens to be where I'm working in now).  Having corner cases that even experts in Arrow have a hard time getting right seems not great, but I don't have solutions either.
>
> Another example of this are the assumptions that null lists take zero length (at least in the Arrow to Parquet code, I'm not seeing where we handle the condition where the assumption is false).
>
>> * some other systems with dictionary encoding allow null as a
>> dictionary value [1]. Sometimes this is used to determine whether to
>> include the "null group" in tabulations / group aggregations
>
> This seems like a good reason to keep it.  Does R support the same double nullness that Arrow has in its spec?  I wonder if an update to the spec saying only one mechanism should be used for nullness would be helpful or if even that is too big of a change?

It seems that when R sees NA in the factor levels, it is treated as a
separate value and not considered null/NA (FWIW, NA and NULL in R are
not the same thing, but that's a different topic that we can avoid for
our purposes)

> a <- factor(c("a", "a", "a", "b", "b"))
> a <- factor(c("a", "a", "a", "b", "b", NA))
> is.na(a)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
> table(a)
a
a b
3 2
> a <- addNA(a)
> is.na(a)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> a
[1] a    a    a    b    b    <NA>
Levels: a b <NA>
> table(a)
a
   a    b <NA>
   3    2    1

If you look at the implementation of functions like "table" you can
see a good amount of code having to do with handling this case (the
factor levels / dictionary values contain a null and so the NA result
needs to be appended to the result).

In practice it seems that the semantics may be application-determined
-- a null in the dictionary values versus a null in the dictionary
indices could have different semantic meanings (or not, it's hard to
say).

> Thanks,
> Micah
>
>
> On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Micah,
>>
>> It seems like the null and nested issues really come up when trying to
>> translate from one dictionary encoding scheme to another. That we are
>> able to directly write dictionary-encoded data to Parquet format is
>> beneficial, but it doesn't seem like we should let the constraints of
>> Parquet's dictionary compression push back into the Arrow
>> specification. I agree that it does add complexity to implementing the
>> specification.
>>
>> Some other thoughts about the null issue
>>
>> * evaluating an expression like SUM(ISNULL($field)) is more
>> semantically ambiguous (you have to check more things) when $field is
>> a dictionary-encoded type and the values of the dictionary could be
>> null
>> * there may be situations where you want to consider null to be a value
>> * some other systems with dictionary encoding allow null as a
>> dictionary value [1]. Sometimes this is used to determine whether to
>> include the "null group" in tabulations / group aggregations
>>
>> Others might have more opinions
>>
>> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
>>
>> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > Hi Wes and Brian,
>> >
>> > Thanks for the feedback.  My intent in raising these issues is that they make the spec harder to work with/implement (i.e. we have existing bugs, etc).  I'm wondering if we should take the opportunity to simplify before things are set in stone. If we think things are already set, then I'm OK with that as well.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <em...@gmail.com> wrote:
>> >> >
>> >> > I'd like to understand if any one is making use of the following features
>> >> > and if we should revisit them before 1.0.
>> >> >
>> >> > 1. Dictionaries can encode null values.
>> >> > - This become error prone for things like parquet.  We seem to be
>> >> > calculating the definition level solely based on the null bitmap.
>> >> >
>> >> > I might have missed something but it appears that we only check if a
>> >> > dictionary contains nulls on the optimized path [1] but not when converting
>> >> > the dictionary array back to dense, so I think the values written could get
>> >> > out of sync with the rep/def levels?
>> >> >
>> >> > It seems we should potentially disallow dictionaries to contain null
>> >> > values?
>> >>
>> >> Are you talking about the direct DictionaryArray encoding path?
>> >>
>> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I think that Parquet will need to check for the null-in-dictionary case in the equivalent of ArrowColumnWriter::Write [1]. You could start by checking and erroring out.
>> >>
>> >> [1]: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>> >>
>> >> >
>> >> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
>> >> > columns.
>> >> >
>> >> > - Again we aren't handling this in Parquet today, and I'm wondering if it
>> >> > worth the effort.
>> >> > There was a PR merged a while ago [2] to add a "skipped" integration test
>> >> > but it doesn't look like anyone has done follow-up work to make enable
>> >> > this/make it pass.
>> >>
>> >> I started looking at this the other day (on the integration tests), I'd like to get the C++ side fully working with integration tests.
>> >>
>> >> As far as Parquet it doesn't seem worth the effort to try to implement a faithful roundtrip. Since we are serializing the Arrow types into the file metadata we could at least cast back but perhaps lose the ordering.
>> >>
>> >> >
>> >> > It seems simpler to keep dictionary encoding at the leafs of the schema.
>> >> >
>> >> > Of the two I'm a little more worried that Option #1 will break people if we
>> >> > decide to disallow it.
>> >>
>> >> I think the ship has sailed on disallowing this at the format level.
>> >>
>> >> >
>> >> > Thoughts?
>> >> >
>> >> > Thanks,
>> >> > Micah
>> >> >
>> >> >
>> >> > [1]
>> >> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
>> >> > [2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Micah Kornfield <em...@gmail.com>.

> * evaluating an expression like SUM(ISNULL($field)) is more
> semantically ambiguous (you have to check more things) when $field is
> a dictionary-encoded type and the values of the dictionary could be
> null

It is this type of thing that I'm worried about (parquet just happens to be
where I'm working in now).  Having corner cases that even experts in Arrow
have a hard time getting right seems not great, but I don't have solutions
either.

Another example of this are the assumptions that null lists take zero
length (at least in the Arrow to Parquet code, I'm not seeing where we
handle the condition where the assumption is false).

* some other systems with dictionary encoding allow null as a
> dictionary value [1]. Sometimes this is used to determine whether to
> include the "null group" in tabulations / group aggregations

This seems like a good reason to keep it.  Does R support the same double
nullness that Arrow has in its spec?  I wonder if an update to the spec
saying only one mechanism should be used for nullness would be helpful or
if even that is too big of a change?

Thanks,
Micah


On Tue, Feb 11, 2020 at 2:36 PM Wes McKinney <we...@gmail.com> wrote:

> hi Micah,
>
> It seems like the null and nested issues really come up when trying to
> translate from one dictionary encoding scheme to another. That we are
> able to directly write dictionary-encoded data to Parquet format is
> beneficial, but it doesn't seem like we should let the constraints of
> Parquet's dictionary compression push back into the Arrow
> specification. I agree that it does add complexity to implementing the
> specification.
>
> Some other thoughts about the null issue
>
> * evaluating an expression like SUM(ISNULL($field)) is more
> semantically ambiguous (you have to check more things) when $field is
> a dictionary-encoded type and the values of the dictionary could be
> null
> * there may be situations where you want to consider null to be a value
> * some other systems with dictionary encoding allow null as a
> dictionary value [1]. Sometimes this is used to determine whether to
> include the "null group" in tabulations / group aggregations
>
> Others might have more opinions
>
> [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
>
> On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Wes and Brian,
> >
> > Thanks for the feedback.  My intent in raising these issues is that they
> make the spec harder to work with/implement (i.e. we have existing bugs,
> etc).  I'm wondering if we should take the opportunity to simplify before
> things are set in stone. If we think things are already set, then I'm OK
> with that as well.
> >
> > Thanks,
> > Micah
> >
> > On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >> >
> >> > I'd like to understand if any one is making use of the following
> features
> >> > and if we should revisit them before 1.0.
> >> >
> >> > 1. Dictionaries can encode null values.
> >> > - This become error prone for things like parquet.  We seem to be
> >> > calculating the definition level solely based on the null bitmap.
> >> >
> >> > I might have missed something but it appears that we only check if a
> >> > dictionary contains nulls on the optimized path [1] but not when
> converting
> >> > the dictionary array back to dense, so I think the values written
> could get
> >> > out of sync with the rep/def levels?
> >> >
> >> > It seems we should potentially disallow dictionaries to contain null
> >> > values?
> >>
> >> Are you talking about the direct DictionaryArray encoding path?
> >>
> >> Since both nulls and duplicates are allowed in Arrow dictionaries, I
> think that Parquet will need to check for the null-in-dictionary case in
> the equivalent of ArrowColumnWriter::Write [1]. You could start by checking
> and erroring out.
> >>
> >> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
> >>
> >> >
> >> > 2.  Dictionaries can nested columns which are in turn dictionary
> encoded
> >> > columns.
> >> >
> >> > - Again we aren't handling this in Parquet today, and I'm wondering
> if it
> >> > worth the effort.
> >> > There was a PR merged a while ago [2] to add a "skipped" integration
> test
> >> > but it doesn't look like anyone has done follow-up work to make enable
> >> > this/make it pass.
> >>
> >> I started looking at this the other day (on the integration tests), I'd
> like to get the C++ side fully working with integration tests.
> >>
> >> As far as Parquet it doesn't seem worth the effort to try to implement
> a faithful roundtrip. Since we are serializing the Arrow types into the
> file metadata we could at least cast back but perhaps lose the ordering.
> >>
> >> >
> >> > It seems simpler to keep dictionary encoding at the leafs of the
> schema.
> >> >
> >> > Of the two I'm a little more worried that Option #1 will break people
> if we
> >> > decide to disallow it.
> >>
> >> I think the ship has sailed on disallowing this at the format level.
> >>
> >> >
> >> > Thoughts?
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1]
> >> >
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> >> > [2] https://github.com/apache/arrow/pull/1848
>

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Wes McKinney <we...@gmail.com>.

hi Micah,

It seems like the null and nested issues really come up when trying to
translate from one dictionary encoding scheme to another. That we are
able to directly write dictionary-encoded data to Parquet format is
beneficial, but it doesn't seem like we should let the constraints of
Parquet's dictionary compression push back into the Arrow
specification. I agree that it does add complexity to implementing the
specification.

Some other thoughts about the null issue

* evaluating an expression like SUM(ISNULL($field)) is more
semantically ambiguous (you have to check more things) when $field is
a dictionary-encoded type and the values of the dictionary could be
null
* there may be situations where you want to consider null to be a value
* some other systems with dictionary encoding allow null as a
dictionary value [1]. Sometimes this is used to determine whether to
include the "null group" in tabulations / group aggregations

Others might have more opinions

[1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

On Mon, Feb 10, 2020 at 2:57 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes and Brian,
>
> Thanks for the feedback.  My intent in raising these issues is that they make the spec harder to work with/implement (i.e. we have existing bugs, etc).  I'm wondering if we should take the opportunity to simplify before things are set in stone. If we think things are already set, then I'm OK with that as well.
>
> Thanks,
> Micah
>
> On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>
>>
>> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > I'd like to understand if any one is making use of the following features
>> > and if we should revisit them before 1.0.
>> >
>> > 1. Dictionaries can encode null values.
>> > - This become error prone for things like parquet.  We seem to be
>> > calculating the definition level solely based on the null bitmap.
>> >
>> > I might have missed something but it appears that we only check if a
>> > dictionary contains nulls on the optimized path [1] but not when converting
>> > the dictionary array back to dense, so I think the values written could get
>> > out of sync with the rep/def levels?
>> >
>> > It seems we should potentially disallow dictionaries to contain null
>> > values?
>>
>> Are you talking about the direct DictionaryArray encoding path?
>>
>> Since both nulls and duplicates are allowed in Arrow dictionaries, I think that Parquet will need to check for the null-in-dictionary case in the equivalent of ArrowColumnWriter::Write [1]. You could start by checking and erroring out.
>>
>> [1]: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>>
>> >
>> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
>> > columns.
>> >
>> > - Again we aren't handling this in Parquet today, and I'm wondering if it
>> > worth the effort.
>> > There was a PR merged a while ago [2] to add a "skipped" integration test
>> > but it doesn't look like anyone has done follow-up work to make enable
>> > this/make it pass.
>>
>> I started looking at this the other day (on the integration tests), I'd like to get the C++ side fully working with integration tests.
>>
>> As far as Parquet it doesn't seem worth the effort to try to implement a faithful roundtrip. Since we are serializing the Arrow types into the file metadata we could at least cast back but perhaps lose the ordering.
>>
>> >
>> > It seems simpler to keep dictionary encoding at the leafs of the schema.
>> >
>> > Of the two I'm a little more worried that Option #1 will break people if we
>> > decide to disallow it.
>>
>> I think the ship has sailed on disallowing this at the format level.
>>
>> >
>> > Thoughts?
>> >
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1]
>> > https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
>> > [2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes and Brian,

Thanks for the feedback.  My intent in raising these issues is that they
make the spec harder to work with/implement (i.e. we have existing bugs,
etc).  I'm wondering if we should take the opportunity to simplify before
things are set in stone. If we think things are already set, then I'm OK
with that as well.

Thanks,
Micah

On Mon, Feb 10, 2020 at 12:40 PM Wes McKinney <we...@gmail.com> wrote:

>
>
> On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I'd like to understand if any one is making use of the following features
> > and if we should revisit them before 1.0.
> >
> > 1. Dictionaries can encode null values.
> > - This become error prone for things like parquet.  We seem to be
> > calculating the definition level solely based on the null bitmap.
> >
> > I might have missed something but it appears that we only check if a
> > dictionary contains nulls on the optimized path [1] but not when
> converting
> > the dictionary array back to dense, so I think the values written could
> get
> > out of sync with the rep/def levels?
> >
> > It seems we should potentially disallow dictionaries to contain null
> > values?
>
> Are you talking about the direct DictionaryArray encoding path?
>
> Since both nulls and duplicates are allowed in Arrow dictionaries, I think
> that Parquet will need to check for the null-in-dictionary case in the
> equivalent of ArrowColumnWriter::Write [1]. You could start by checking and
> erroring out.
>
> [1]:
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324
>
> >
> > 2.  Dictionaries can nested columns which are in turn dictionary encoded
> > columns.
> >
> > - Again we aren't handling this in Parquet today, and I'm wondering if it
> > worth the effort.
> > There was a PR merged a while ago [2] to add a "skipped" integration test
> > but it doesn't look like anyone has done follow-up work to make enable
> > this/make it pass.
>
> I started looking at this the other day (on the integration tests), I'd
> like to get the C++ side fully working with integration tests.
>
> As far as Parquet it doesn't seem worth the effort to try to implement a
> faithful roundtrip. Since we are serializing the Arrow types into the file
> metadata we could at least cast back but perhaps lose the ordering.
>
> >
> > It seems simpler to keep dictionary encoding at the leafs of the schema.
> >
> > Of the two I'm a little more worried that Option #1 will break people if
> we
> > decide to disallow it.
>
> I think the ship has sailed on disallowing this at the format level.
>
> >
> > Thoughts?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> > [2] https://github.com/apache/arrow/pull/1848
>

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Feb 9, 2020 at 12:53 AM Micah Kornfield <em...@gmail.com>
wrote:
>
> I'd like to understand if any one is making use of the following features
> and if we should revisit them before 1.0.
>
> 1. Dictionaries can encode null values.
> - This become error prone for things like parquet.  We seem to be
> calculating the definition level solely based on the null bitmap.
>
> I might have missed something but it appears that we only check if a
> dictionary contains nulls on the optimized path [1] but not when
converting
> the dictionary array back to dense, so I think the values written could
get
> out of sync with the rep/def levels?
>
> It seems we should potentially disallow dictionaries to contain null
> values?

Are you talking about the direct DictionaryArray encoding path?

Since both nulls and duplicates are allowed in Arrow dictionaries, I think
that Parquet will need to check for the null-in-dictionary case in the
equivalent of ArrowColumnWriter::Write [1]. You could start by checking and
erroring out.

[1]:
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L324

>
> 2.  Dictionaries can nested columns which are in turn dictionary encoded
> columns.
>
> - Again we aren't handling this in Parquet today, and I'm wondering if it
> worth the effort.
> There was a PR merged a while ago [2] to add a "skipped" integration test
> but it doesn't look like anyone has done follow-up work to make enable
> this/make it pass.

I started looking at this the other day (on the integration tests), I'd
like to get the C++ side fully working with integration tests.

As far as Parquet it doesn't seem worth the effort to try to implement a
faithful roundtrip. Since we are serializing the Arrow types into the file
metadata we could at least cast back but perhaps lose the ordering.

>
> It seems simpler to keep dictionary encoding at the leafs of the schema.
>
> Of the two I'm a little more worried that Option #1 will break people if
we
> decide to disallow it.

I think the ship has sailed on disallowing this at the format level.

>
> Thoughts?
>
> Thanks,
> Micah
>
>
> [1]
>
https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> [2] https://github.com/apache/arrow/pull/1848

Re: [Format] Dictionary edge cases (encoding nulls and nested dictionaries)

Posted by Brian Hulette <hu...@gmail.com>.

> It seems we should potentially disallow dictionaries to contain null
values?
+1 - I've always thought it was odd you could encode null values in two
different places for dictionary encoded columns.
You could argue it's more efficient to encode the nulls in the dictionary,
but I think if we're going to allow that we should go further - we know
there should only be _one_ index with the NULL value in a dictionary, why
encode an entire validity buffer? Maybe this is one place where a sentinel
value makes sense.


The mailing list thread where I brought up the idea of nested dictionaries
[1] is useful context for item 2. I still think this is a good idea, but
I've changed jobs since then and the use-case I described is no longer
motivating me to actually implement it.

> It seems simpler to keep dictionary encoding at the leafs of the schema.
Do we need to go that far? I think we could still allow dictionary encoding
at any level of a hierarchy, and just disallow nested dictionaries.

Brian

[1]
https://lists.apache.org/thread.html/37c0480c4c7a48dd298e8459938444afb901bf01dcebd5f8c5f1dee6%40%3Cdev.arrow.apache.org%3E

On Sat, Feb 8, 2020 at 10:53 PM Micah Kornfield <em...@gmail.com>
wrote:

> I'd like to understand if any one is making use of the following features
> and if we should revisit them before 1.0.
>
> 1. Dictionaries can encode null values.
> - This become error prone for things like parquet.  We seem to be
> calculating the definition level solely based on the null bitmap.
>
> I might have missed something but it appears that we only check if a
> dictionary contains nulls on the optimized path [1] but not when converting
> the dictionary array back to dense, so I think the values written could get
> out of sync with the rep/def levels?
>
> It seems we should potentially disallow dictionaries to contain null
> values?
>
> 2.  Dictionaries can nested columns which are in turn dictionary encoded
> columns.
>
> - Again we aren't handling this in Parquet today, and I'm wondering if it
> worth the effort.
> There was a PR merged a while ago [2] to add a "skipped" integration test
> but it doesn't look like anyone has done follow-up work to make enable
> this/make it pass.
>
> It seems simpler to keep dictionary encoding at the leafs of the schema.
>
> Of the two I'm a little more worried that Option #1 will break people if we
> decide to disallow it.
>
> Thoughts?
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/arrow/blob/bd38beec033a2fdff192273df9b08f120e635b0c/cpp/src/parquet/encoding.cc#L765
> [2] https://github.com/apache/arrow/pull/1848
>