You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/11/06 17:23:42 UTC

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Just bumping this thread for more comments

On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney <we...@gmail.com> wrote:
>
> Returning to this discussion as there seems to lack consensus in the vote thread
>
> Copying Micah's proposals in the VOTE thread here, I wanted to state
> my opinions so we can discuss further and see where there is potential
> disagreement
>
> 1.  It is not required that all dictionary batches occur at the beginning
> of the IPC stream format (if a the first record batch has an all null
> dictionary encoded column, the null column's dictionary might not be sent
> until later in the stream).
>
> This seems preferable to requiring a placeholder empty dictionary
> batch. This does mean more to test but the integration tests will
> force the issue
>
> 2.  A second dictionary batch for the same ID that is not a "delta batch"
> in an IPC stream indicates the dictionary should be replaced.
>
> Agree.
>
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches.
>
> Agree -- it is also worth stating explicitly that dictionary
> replacements are not allowed in the file format.
>
> In the file format, all the dictionaries must be "loaded" up front.
> The code path for loading the dictionaries ideally should use nearly
> the same code as the stream-reader code that sees follow-up dictionary
> batches interspersed in the stream. The only downside is that it will
> not be possible to exactly preserve the dictionary "state" as of each
> record batch being written.
>
> So if we had a file containing
>
> DICTIONARY ID=0
> RECORD BATCH
> RECORD BATCH
> DICTIONARY DELTA ID=0
> RECORD BATCH
> RECORD BATCH
>
> Then after processing/loading the dictionaries, the first two record
> batches will have a dictionary that is "larger" (on account of the
> delta) than when they were written. Since dictionaries are
> fundamentally about data representation, they still represent the same
> data so I think this is acceptable.
>
> 4.  Add an enum to dictionary metadata for possible future changes in what
> format dictionary batches can be sent. (the most likely would be an array
> Map<Int, Value>).  An enum is needed as a place holder to allow for forward
> compatibility past the release 1.0.0.
>
> I'm least sure about this but I do not think it is harmful to have a
> forward-compatible "escape hatch" for future evolutions in dictionary
> encoding.
>
> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <em...@gmail.com> wrote:
> >
> > I'll plan on starting a vote in the next day or two if there are no further
> > objections/comments.
> >
> > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > I think the only point asked on the PR that I think is worth discussing is
> > > assumptions about dictionaries at the beginning of streams.
> > >
> > > There are two options:
> > > 1.  Based on the current wording, it does not seem that all dictionaries
> > > need to be at the beginning of the stream if they aren't made use of in the
> > > first record batch (i.e. a dictionary encoded column is all null in the
> > > first record batch).
> > > 2.  We require a dictionary batch for each dictionary at the beginning of
> > > the stream (and require implementations to send an empty batch if they
> > > don't have the dictionary available).
> > >
> > > The current proposal in the PR is option #1.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > >> I've opened a pull request [1] to clarify some recent conversations about
> > >> semantics/edge cases for dictionary encoding [2][3] around interleaved
> > >> batches and when isDelta=False.
> > >>
> > >> Specifically, it proposes isDelta=False indicates dictionary
> > >> replacement.  For the file format, only one isDelta=False batch is allowed
> > >> per file and isDelta=true batches are applied in the order supplied file
> > >> footer.
> > >>
> > >> In addition, I've added a new enum to DictionaryEncoding to preserve
> > >> future compatibility in case we want to expand dictionary encoding to be an
> > >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> > >>
> > >> Once people have had a change to review and come to a consensus. I will
> > >> call a formal vote to approve the change commit the change.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://github.com/apache/arrow/pull/5585
> > >> [2]
> > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> > >> [3]
> > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> > >> [4]
> > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> > >>
> > >>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Posted by Micah Kornfield <em...@gmail.com>.
I think the main sticking point was dictionaries in the file format.  It
seems like the use-case for delta dictionaries might be limited so I didn't
feel strongly about it.

Antoine, did you have more thoughts on this?

Thanks,
Micah


On Wed, Nov 6, 2019 at 9:24 AM Wes McKinney <we...@gmail.com> wrote:

> Just bumping this thread for more comments
>
> On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > Returning to this discussion as there seems to lack consensus in the
> vote thread
> >
> > Copying Micah's proposals in the VOTE thread here, I wanted to state
> > my opinions so we can discuss further and see where there is potential
> > disagreement
> >
> > 1.  It is not required that all dictionary batches occur at the beginning
> > of the IPC stream format (if a the first record batch has an all null
> > dictionary encoded column, the null column's dictionary might not be sent
> > until later in the stream).
> >
> > This seems preferable to requiring a placeholder empty dictionary
> > batch. This does mean more to test but the integration tests will
> > force the issue
> >
> > 2.  A second dictionary batch for the same ID that is not a "delta batch"
> > in an IPC stream indicates the dictionary should be replaced.
> >
> > Agree.
> >
> > 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> > dictionary batch and multiple "delta" dictionary batches.
> >
> > Agree -- it is also worth stating explicitly that dictionary
> > replacements are not allowed in the file format.
> >
> > In the file format, all the dictionaries must be "loaded" up front.
> > The code path for loading the dictionaries ideally should use nearly
> > the same code as the stream-reader code that sees follow-up dictionary
> > batches interspersed in the stream. The only downside is that it will
> > not be possible to exactly preserve the dictionary "state" as of each
> > record batch being written.
> >
> > So if we had a file containing
> >
> > DICTIONARY ID=0
> > RECORD BATCH
> > RECORD BATCH
> > DICTIONARY DELTA ID=0
> > RECORD BATCH
> > RECORD BATCH
> >
> > Then after processing/loading the dictionaries, the first two record
> > batches will have a dictionary that is "larger" (on account of the
> > delta) than when they were written. Since dictionaries are
> > fundamentally about data representation, they still represent the same
> > data so I think this is acceptable.
> >
> > 4.  Add an enum to dictionary metadata for possible future changes in
> what
> > format dictionary batches can be sent. (the most likely would be an array
> > Map<Int, Value>).  An enum is needed as a place holder to allow for
> forward
> > compatibility past the release 1.0.0.
> >
> > I'm least sure about this but I do not think it is harmful to have a
> > forward-compatible "escape hatch" for future evolutions in dictionary
> > encoding.
> >
> > On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > I'll plan on starting a vote in the next day or two if there are no
> further
> > > objections/comments.
> > >
> > > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <
> emkornfield@gmail.com>
> > > wrote:
> > >
> > > > I think the only point asked on the PR that I think is worth
> discussing is
> > > > assumptions about dictionaries at the beginning of streams.
> > > >
> > > > There are two options:
> > > > 1.  Based on the current wording, it does not seem that all
> dictionaries
> > > > need to be at the beginning of the stream if they aren't made use of
> in the
> > > > first record batch (i.e. a dictionary encoded column is all null in
> the
> > > > first record batch).
> > > > 2.  We require a dictionary batch for each dictionary at the
> beginning of
> > > > the stream (and require implementations to send an empty batch if
> they
> > > > don't have the dictionary available).
> > > >
> > > > The current proposal in the PR is option #1.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > >
> > > >> I've opened a pull request [1] to clarify some recent conversations
> about
> > > >> semantics/edge cases for dictionary encoding [2][3] around
> interleaved
> > > >> batches and when isDelta=False.
> > > >>
> > > >> Specifically, it proposes isDelta=False indicates dictionary
> > > >> replacement.  For the file format, only one isDelta=False batch is
> allowed
> > > >> per file and isDelta=true batches are applied in the order supplied
> file
> > > >> footer.
> > > >>
> > > >> In addition, I've added a new enum to DictionaryEncoding to preserve
> > > >> future compatibility in case we want to expand dictionary encoding
> to be an
> > > >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> > > >>
> > > >> Once people have had a change to review and come to a consensus. I
> will
> > > >> call a formal vote to approve the change commit the change.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1] https://github.com/apache/arrow/pull/5585
> > > >> [2]
> > > >>
> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> > > >> [3]
> > > >>
> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> > > >> [4]
> > > >>
> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> > > >>
> > > >>
>