You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/10/24 02:39:09 UTC

[VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Hello,
As discussed on [1], I've proposed clarifications in a PR [2] that
clarifies:

1.  It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).

2.  A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.

3.  Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches.

4.  Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map<Int, Value>).  An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.

If accepted there will be work in all implementations to make sure that
they cover the edge cases highlighted and additional integration testing
will be needed.

Please vote whether to accept these additions. The vote will be open for at
least 72 hours.

[ ] +1 Accept these change to the specification
[ ] +0
[ ] -1 Do not accept the changes because...

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/5585

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Posted by Wes McKinney <we...@gmail.com>.
I wrote in on the original DISCUSS thread. I believe Antoine is
unavailable this week, but hopefully we can drive the discussion to a
consensus point next week so we can vote

On Sat, Oct 26, 2019 at 12:01 AM Micah Kornfield <em...@gmail.com> wrote:
>
> I think at least the wording was confusing because you raised questions on the PR and Antoine commented here.
>
> I agree with your analysis that it probably would not be hard to support.  But don't feel too strongly either way on this particular point, aside from coming to a resolution.   If I had to choose I'd prefer allowing Delta dictionaries in files.
>
> On Friday, October 25, 2019, Wes McKinney <we...@gmail.com> wrote:
>>
>> Can we discuss the delta dictionary issue a bit more? I admit I don't
>> share that same concerns.
>>
>> From the perspective of a file and stream producer, the code paths
>> should be nearly identical. The differences with the file format are:
>>
>> * Magic numbers to detect that it is the "file format"
>> * Accumulated metadata at the footer
>>
>> If a file has any dictionaries at all, then they must all be
>> reconstructed before reading a record batch. So let's say we have a
>> file like
>>
>> DICTIONARY ID=0, isDelta=FALSE
>> BATCH 0
>> BATCH 1
>> BATCH 2
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 3
>> DICTIONARY ID=0, isDelta=TRUE
>> BATCH 4
>>
>> I do not see any harm in this -- the only downside is that you won't
>> know what "state" the dictionary was in for the first 3 batches.
>> Viewing dictionary encoding strictly as a data representation method,
>> the batches 0-2 and 3 represent the same data even if their in-memory
>> dictionaries are larger than they were than the moment in which they
>> were written
>>
>> Note that the code path for "processing" the dictionaries as a first
>> step will use the same code as the stream path. It should not be a
>> great deal of work to write test cases for this
>>
>> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > Hi Antoine,
>> > There is a defined order for dictionaries in metadata.  What isn't well
>> > defined is relative ordering between record batches and Delta dictionaries.
>> >
>> >  However, this point seems confusing. I can't think of a real-world use
>> > case we're it would be valuable enough to include, so I will remove Delta
>> > dictionaries.
>> >
>> > So let's cancel this vote and I'll start a new one after the update.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thursday, October 24, 2019, Antoine Pitrou <an...@python.org> wrote:
>> >
>> > >
>> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
>> > > >
>> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
>> > > > dictionary batch and multiple "delta" dictionary batches.
>> > >
>> > > This is a bit weird.  If the file format can carry delta dictionaries,
>> > > it means order is significant, so it may as well contain dictionary
>> > > redefinitions.
>> > >
>> > > If the file format is meant to be truly readable in random order, then
>> > > it should also forbid delta dictionaries.
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Posted by Micah Kornfield <em...@gmail.com>.
I think at least the wording was confusing because you raised questions on
the PR and Antoine commented here.

I agree with your analysis that it probably would not be hard to support.
But don't feel too strongly either way on this particular point, aside from
coming to a resolution.   If I had to choose I'd prefer allowing Delta
dictionaries in files.

On Friday, October 25, 2019, Wes McKinney <we...@gmail.com> wrote:

> Can we discuss the delta dictionary issue a bit more? I admit I don't
> share that same concerns.
>
> From the perspective of a file and stream producer, the code paths
> should be nearly identical. The differences with the file format are:
>
> * Magic numbers to detect that it is the "file format"
> * Accumulated metadata at the footer
>
> If a file has any dictionaries at all, then they must all be
> reconstructed before reading a record batch. So let's say we have a
> file like
>
> DICTIONARY ID=0, isDelta=FALSE
> BATCH 0
> BATCH 1
> BATCH 2
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 3
> DICTIONARY ID=0, isDelta=TRUE
> BATCH 4
>
> I do not see any harm in this -- the only downside is that you won't
> know what "state" the dictionary was in for the first 3 batches.
> Viewing dictionary encoding strictly as a data representation method,
> the batches 0-2 and 3 represent the same data even if their in-memory
> dictionaries are larger than they were than the moment in which they
> were written
>
> Note that the code path for "processing" the dictionaries as a first
> step will use the same code as the stream path. It should not be a
> great deal of work to write test cases for this
>
> On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Antoine,
> > There is a defined order for dictionaries in metadata.  What isn't well
> > defined is relative ordering between record batches and Delta
> dictionaries.
> >
> >  However, this point seems confusing. I can't think of a real-world use
> > case we're it would be valuable enough to include, so I will remove Delta
> > dictionaries.
> >
> > So let's cancel this vote and I'll start a new one after the update.
> >
> > Thanks,
> > Micah
> >
> > On Thursday, October 24, 2019, Antoine Pitrou <an...@python.org>
> wrote:
> >
> > >
> > > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> > > >
> > > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > > > dictionary batch and multiple "delta" dictionary batches.
> > >
> > > This is a bit weird.  If the file format can carry delta dictionaries,
> > > it means order is significant, so it may as well contain dictionary
> > > redefinitions.
> > >
> > > If the file format is meant to be truly readable in random order, then
> > > it should also forbid delta dictionaries.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Posted by Wes McKinney <we...@gmail.com>.
Can we discuss the delta dictionary issue a bit more? I admit I don't
share that same concerns.

From the perspective of a file and stream producer, the code paths
should be nearly identical. The differences with the file format are:

* Magic numbers to detect that it is the "file format"
* Accumulated metadata at the footer

If a file has any dictionaries at all, then they must all be
reconstructed before reading a record batch. So let's say we have a
file like

DICTIONARY ID=0, isDelta=FALSE
BATCH 0
BATCH 1
BATCH 2
DICTIONARY ID=0, isDelta=TRUE
BATCH 3
DICTIONARY ID=0, isDelta=TRUE
BATCH 4

I do not see any harm in this -- the only downside is that you won't
know what "state" the dictionary was in for the first 3 batches.
Viewing dictionary encoding strictly as a data representation method,
the batches 0-2 and 3 represent the same data even if their in-memory
dictionaries are larger than they were than the moment in which they
were written

Note that the code path for "processing" the dictionaries as a first
step will use the same code as the stream path. It should not be a
great deal of work to write test cases for this

On Thu, Oct 24, 2019 at 11:06 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Antoine,
> There is a defined order for dictionaries in metadata.  What isn't well
> defined is relative ordering between record batches and Delta dictionaries.
>
>  However, this point seems confusing. I can't think of a real-world use
> case we're it would be valuable enough to include, so I will remove Delta
> dictionaries.
>
> So let's cancel this vote and I'll start a new one after the update.
>
> Thanks,
> Micah
>
> On Thursday, October 24, 2019, Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> > >
> > > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > > dictionary batch and multiple "delta" dictionary batches.
> >
> > This is a bit weird.  If the file format can carry delta dictionaries,
> > it means order is significant, so it may as well contain dictionary
> > redefinitions.
> >
> > If the file format is meant to be truly readable in random order, then
> > it should also forbid delta dictionaries.
> >
> > Regards
> >
> > Antoine.
> >

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Posted by Micah Kornfield <em...@gmail.com>.
Hi Antoine,
There is a defined order for dictionaries in metadata.  What isn't well
defined is relative ordering between record batches and Delta dictionaries.

 However, this point seems confusing. I can't think of a real-world use
case we're it would be valuable enough to include, so I will remove Delta
dictionaries.

So let's cancel this vote and I'll start a new one after the update.

Thanks,
Micah

On Thursday, October 24, 2019, Antoine Pitrou <an...@python.org> wrote:

>
> Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> >
> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > dictionary batch and multiple "delta" dictionary batches.
>
> This is a bit weird.  If the file format can carry delta dictionaries,
> it means order is significant, so it may as well contain dictionary
> redefinitions.
>
> If the file format is meant to be truly readable in random order, then
> it should also forbid delta dictionaries.
>
> Regards
>
> Antoine.
>

Re: [VOTE] Clarifications and forward compatibility changes for Dictionary Encoding

Posted by Antoine Pitrou <an...@python.org>.
Le 24/10/2019 à 04:39, Micah Kornfield a écrit :
> 
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches.

This is a bit weird.  If the file format can carry delta dictionaries,
it means order is significant, so it may as well contain dictionary
redefinitions.

If the file format is meant to be truly readable in random order, then
it should also forbid delta dictionaries.

Regards

Antoine.