You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/08/10 07:12:54 UTC

[Format] Semantics for dictionary batches in streams

The IPC specification [1] defines behavior when isDelta on a
DictionaryBatch [2] is "true".  I might have missed it in the
specification, but I couldn't find the interpretation for what the expected
behavior is when isDelta=false and and two  dictionary batches  with the
same ID are sent.

It seems like there are two options:
1.  Interpret the new dictionary batch as replacing the old one.
2.  Regard this as an error condition.

Based on the fact that in the "file format" dictionaries are allowed to be
placed in any order relative to the record batches, I assume it is the
second, but just wanted to make sure.

Thanks,
Micah

[1] https://arrow.apache.org/docs/ipc.html
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
Yes, I opened a JIRA, I'm going to try to make a proposal that consolidates
all the recent dictionary discussions.

On Mon, Sep 9, 2019 at 12:21 PM Wes McKinney <we...@gmail.com> wrote:

> hi Micah,
>
> I think we should formulate changes to format/Columnar.rst and have a
> vote, what do you think?
>
> On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >>
> >>
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify?
> >
> > I think you clarified it very well :) My motivation for suggesting the
> additional complexity is I see two use-cases for the file format.  These
> roughly correspond with the two options I suggested:
> > 1.  We are encoding data from scratch.  In this case, it seems like all
> dictionaries would be built incrementally, not need replacement and we
> write them at the end of the file [1]
> >
> > 2.  The data being written out is essentially a "tee" off of some stream
> that is generating new dictionaries requiring replacement on the fly (i.e.
> reading back two parquet files).
> >
> >>  It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >
> > It is is certainly possible, to accept the slippage from the stream
> format for now and later add this capability, since it should be forwards
> compatible.
> >
> > Thanks,
> > Micah
> >
> > [1] There is also medium complexity option where we require one
> non-delta dictionary and as many delta dictionaries as the user want.
> >
> > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >> >
> >> > I was thinking the file format must satisfy one of two conditions:
> >> > 1.  Exactly one dictionarybatch per encoded column
> >> > 2.  DictionaryBatches are interleaved correctly.
> >>
> >> Could you clarify? In the first case, there is no issue with
> >> dictionary replacements. I'm not sure about the second case -- if a
> >> dictionary id appears twice, then you'll see it twice in the file
> >> footer. I suppose you could look at the file offsets to determine
> >> whether a dictionary batch precedes a particular record batch block
> >> (to know which dictionary you should be using), but that's rather
> >> complicated to implement. It might be better to disallow replacements
> >> in the file format (which does introduce semantic slippage between the
> >> file and stream formats as Antoine was saying).
> >>
> >> >
> >> > On Tuesday, August 27, 2019, Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org>
> wrote:
> >> > > >
> >> > > >
> >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> >> > > > > So the current situation we have right now in C++ is that if we
> tried
> >> > > > > to create an IPC stream from a sequence of record batches that
> don't
> >> > > > > all have the same dictionary, we'd run into two scenarios:
> >> > > > >
> >> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> >> > > > > the prior dictionary is a prefix of their dictionary. For
> example,
> >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C']
> and
> >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E'].
> In
> >> > > > > such case we could compute and send a delta batch
> >> > > > >
> >> > > > > * Batches with a dictionary that is a permutation of values, and
> >> > > > > possibly new unique values.
> >> > > > >
> >> > > > > In this latter case, without the option of replacing an
> existing ID in
> >> > > > > the stream, we would have to do a unification / permutation of
> indices
> >> > > > > and then also possibly send a delta batch. We should probably
> have
> >> > > > > code at some point that deals with both cases, but in the
> meantime I
> >> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> >> > > > > like we might need a vote to formalize this?
> >> > > >
> >> > > > Isn't the stream format deviating from the file format then?  In
> the
> >> > > > file format, IIUC, dictionaries can appear after the respective
> record
> >> > > > batches, so there's no way to tell whether the original or
> redefined
> >> > > > version of a dictionary is being referred to.
> >> > >
> >> > > You make a good point -- we can consider changes to the file format
> to
> >> > > allow for record batches to have different dictionaries. Even
> handling
> >> > > delta dictionaries with the current file format would be a bit
> tedious
> >> > > (though not indeterminate)
> >> > >
> >> > > > Regards
> >> > > >
> >> > > > Antoine.
> >> > >
>

Re: [Format] Semantics for dictionary batches in streams

Posted by Wes McKinney <we...@gmail.com>.
hi Micah,

I think we should formulate changes to format/Columnar.rst and have a
vote, what do you think?

On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <em...@gmail.com> wrote:
>>
>>
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify?
>
> I think you clarified it very well :) My motivation for suggesting the additional complexity is I see two use-cases for the file format.  These roughly correspond with the two options I suggested:
> 1.  We are encoding data from scratch.  In this case, it seems like all dictionaries would be built incrementally, not need replacement and we write them at the end of the file [1]
>
> 2.  The data being written out is essentially a "tee" off of some stream that is generating new dictionaries requiring replacement on the fly (i.e. reading back two parquet files).
>
>>  It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>
> It is is certainly possible, to accept the slippage from the stream format for now and later add this capability, since it should be forwards compatible.
>
> Thanks,
> Micah
>
> [1] There is also medium complexity option where we require one non-delta dictionary and as many delta dictionaries as the user want.
>
> On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <we...@gmail.com> wrote:
>>
>> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify? In the first case, there is no issue with
>> dictionary replacements. I'm not sure about the second case -- if a
>> dictionary id appears twice, then you'll see it twice in the file
>> footer. I suppose you could look at the file offsets to determine
>> whether a dictionary batch precedes a particular record batch block
>> (to know which dictionary you should be using), but that's rather
>> complicated to implement. It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>>
>> >
>> > On Tuesday, August 27, 2019, Wes McKinney <we...@gmail.com> wrote:
>> >
>> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org> wrote:
>> > > >
>> > > >
>> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
>> > > > > So the current situation we have right now in C++ is that if we tried
>> > > > > to create an IPC stream from a sequence of record batches that don't
>> > > > > all have the same dictionary, we'd run into two scenarios:
>> > > > >
>> > > > > * Batches that either have a prefix of a prior-observed dictionary, or
>> > > > > the prior dictionary is a prefix of their dictionary. For example,
>> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
>> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
>> > > > > such case we could compute and send a delta batch
>> > > > >
>> > > > > * Batches with a dictionary that is a permutation of values, and
>> > > > > possibly new unique values.
>> > > > >
>> > > > > In this latter case, without the option of replacing an existing ID in
>> > > > > the stream, we would have to do a unification / permutation of indices
>> > > > > and then also possibly send a delta batch. We should probably have
>> > > > > code at some point that deals with both cases, but in the meantime I
>> > > > > would like to allow dictionaries to be redefined in this case. Seems
>> > > > > like we might need a vote to formalize this?
>> > > >
>> > > > Isn't the stream format deviating from the file format then?  In the
>> > > > file format, IIUC, dictionaries can appear after the respective record
>> > > > batches, so there's no way to tell whether the original or redefined
>> > > > version of a dictionary is being referred to.
>> > >
>> > > You make a good point -- we can consider changes to the file format to
>> > > allow for record batches to have different dictionaries. Even handling
>> > > delta dictionaries with the current file format would be a bit tedious
>> > > (though not indeterminate)
>> > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > >

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
>
>
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.

Could you clarify?

I think you clarified it very well :) My motivation for suggesting the
additional complexity is I see two use-cases for the file format.  These
roughly correspond with the two options I suggested:
1.  We are encoding data from scratch.  In this case, it seems like all
dictionaries would be built incrementally, not need replacement and we
write them at the end of the file [1]

2.  The data being written out is essentially a "tee" off of some stream
that is generating new dictionaries requiring replacement on the fly (i.e.
reading back two parquet files).

 It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).

It is is certainly possible, to accept the slippage from the stream format
for now and later add this capability, since it should be forwards
compatible.

Thanks,
Micah

[1] There is also medium complexity option where we require one non-delta
dictionary and as many delta dictionaries as the user want.

On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <we...@gmail.com> wrote:

> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I was thinking the file format must satisfy one of two conditions:
> > 1.  Exactly one dictionarybatch per encoded column
> > 2.  DictionaryBatches are interleaved correctly.
>
> Could you clarify? In the first case, there is no issue with
> dictionary replacements. I'm not sure about the second case -- if a
> dictionary id appears twice, then you'll see it twice in the file
> footer. I suppose you could look at the file offsets to determine
> whether a dictionary batch precedes a particular record batch block
> (to know which dictionary you should be using), but that's rather
> complicated to implement. It might be better to disallow replacements
> in the file format (which does introduce semantic slippage between the
> file and stream formats as Antoine was saying).
>
> >
> > On Tuesday, August 27, 2019, Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org>
> wrote:
> > > >
> > > >
> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > > > So the current situation we have right now in C++ is that if we
> tried
> > > > > to create an IPC stream from a sequence of record batches that
> don't
> > > > > all have the same dictionary, we'd run into two scenarios:
> > > > >
> > > > > * Batches that either have a prefix of a prior-observed
> dictionary, or
> > > > > the prior dictionary is a prefix of their dictionary. For example,
> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > > > such case we could compute and send a delta batch
> > > > >
> > > > > * Batches with a dictionary that is a permutation of values, and
> > > > > possibly new unique values.
> > > > >
> > > > > In this latter case, without the option of replacing an existing
> ID in
> > > > > the stream, we would have to do a unification / permutation of
> indices
> > > > > and then also possibly send a delta batch. We should probably have
> > > > > code at some point that deals with both cases, but in the meantime
> I
> > > > > would like to allow dictionaries to be redefined in this case.
> Seems
> > > > > like we might need a vote to formalize this?
> > > >
> > > > Isn't the stream format deviating from the file format then?  In the
> > > > file format, IIUC, dictionaries can appear after the respective
> record
> > > > batches, so there's no way to tell whether the original or redefined
> > > > version of a dictionary is being referred to.
> > >
> > > You make a good point -- we can consider changes to the file format to
> > > allow for record batches to have different dictionaries. Even handling
> > > delta dictionaries with the current file format would be a bit tedious
> > > (though not indeterminate)
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
>

Re: [Format] Semantics for dictionary batches in streams

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I was thinking the file format must satisfy one of two conditions:
> 1.  Exactly one dictionarybatch per encoded column
> 2.  DictionaryBatches are interleaved correctly.

Could you clarify? In the first case, there is no issue with
dictionary replacements. I'm not sure about the second case -- if a
dictionary id appears twice, then you'll see it twice in the file
footer. I suppose you could look at the file offsets to determine
whether a dictionary batch precedes a particular record batch block
(to know which dictionary you should be using), but that's rather
complicated to implement. It might be better to disallow replacements
in the file format (which does introduce semantic slippage between the
file and stream formats as Antoine was saying).

>
> On Tuesday, August 27, 2019, Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > > So the current situation we have right now in C++ is that if we tried
> > > > to create an IPC stream from a sequence of record batches that don't
> > > > all have the same dictionary, we'd run into two scenarios:
> > > >
> > > > * Batches that either have a prefix of a prior-observed dictionary, or
> > > > the prior dictionary is a prefix of their dictionary. For example,
> > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > > such case we could compute and send a delta batch
> > > >
> > > > * Batches with a dictionary that is a permutation of values, and
> > > > possibly new unique values.
> > > >
> > > > In this latter case, without the option of replacing an existing ID in
> > > > the stream, we would have to do a unification / permutation of indices
> > > > and then also possibly send a delta batch. We should probably have
> > > > code at some point that deals with both cases, but in the meantime I
> > > > would like to allow dictionaries to be redefined in this case. Seems
> > > > like we might need a vote to formalize this?
> > >
> > > Isn't the stream format deviating from the file format then?  In the
> > > file format, IIUC, dictionaries can appear after the respective record
> > > batches, so there's no way to tell whether the original or redefined
> > > version of a dictionary is being referred to.
> >
> > You make a good point -- we can consider changes to the file format to
> > allow for record batches to have different dictionaries. Even handling
> > delta dictionaries with the current file format would be a bit tedious
> > (though not indeterminate)
> >
> > > Regards
> > >
> > > Antoine.
> >

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
I was thinking the file format must satisfy one of two conditions:
1.  Exactly one dictionarybatch per encoded column
2.  DictionaryBatches are interleaved correctly.

On Tuesday, August 27, 2019, Wes McKinney <we...@gmail.com> wrote:

> On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > > So the current situation we have right now in C++ is that if we tried
> > > to create an IPC stream from a sequence of record batches that don't
> > > all have the same dictionary, we'd run into two scenarios:
> > >
> > > * Batches that either have a prefix of a prior-observed dictionary, or
> > > the prior dictionary is a prefix of their dictionary. For example,
> > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > > such case we could compute and send a delta batch
> > >
> > > * Batches with a dictionary that is a permutation of values, and
> > > possibly new unique values.
> > >
> > > In this latter case, without the option of replacing an existing ID in
> > > the stream, we would have to do a unification / permutation of indices
> > > and then also possibly send a delta batch. We should probably have
> > > code at some point that deals with both cases, but in the meantime I
> > > would like to allow dictionaries to be redefined in this case. Seems
> > > like we might need a vote to formalize this?
> >
> > Isn't the stream format deviating from the file format then?  In the
> > file format, IIUC, dictionaries can appear after the respective record
> > batches, so there's no way to tell whether the original or redefined
> > version of a dictionary is being referred to.
>
> You make a good point -- we can consider changes to the file format to
> allow for record batches to have different dictionaries. Even handling
> delta dictionaries with the current file format would be a bit tedious
> (though not indeterminate)
>
> > Regards
> >
> > Antoine.
>

Re: [Format] Semantics for dictionary batches in streams

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > So the current situation we have right now in C++ is that if we tried
> > to create an IPC stream from a sequence of record batches that don't
> > all have the same dictionary, we'd run into two scenarios:
> >
> > * Batches that either have a prefix of a prior-observed dictionary, or
> > the prior dictionary is a prefix of their dictionary. For example,
> > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > such case we could compute and send a delta batch
> >
> > * Batches with a dictionary that is a permutation of values, and
> > possibly new unique values.
> >
> > In this latter case, without the option of replacing an existing ID in
> > the stream, we would have to do a unification / permutation of indices
> > and then also possibly send a delta batch. We should probably have
> > code at some point that deals with both cases, but in the meantime I
> > would like to allow dictionaries to be redefined in this case. Seems
> > like we might need a vote to formalize this?
>
> Isn't the stream format deviating from the file format then?  In the
> file format, IIUC, dictionaries can appear after the respective record
> batches, so there's no way to tell whether the original or redefined
> version of a dictionary is being referred to.

You make a good point -- we can consider changes to the file format to
allow for record batches to have different dictionaries. Even handling
delta dictionaries with the current file format would be a bit tedious
(though not indeterminate)

> Regards
>
> Antoine.

Re: [Format] Semantics for dictionary batches in streams

Posted by Antoine Pitrou <an...@python.org>.
Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> So the current situation we have right now in C++ is that if we tried
> to create an IPC stream from a sequence of record batches that don't
> all have the same dictionary, we'd run into two scenarios:
> 
> * Batches that either have a prefix of a prior-observed dictionary, or
> the prior dictionary is a prefix of their dictionary. For example,
> suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> such case we could compute and send a delta batch
> 
> * Batches with a dictionary that is a permutation of values, and
> possibly new unique values.
> 
> In this latter case, without the option of replacing an existing ID in
> the stream, we would have to do a unification / permutation of indices
> and then also possibly send a delta batch. We should probably have
> code at some point that deals with both cases, but in the meantime I
> would like to allow dictionaries to be redefined in this case. Seems
> like we might need a vote to formalize this?

Isn't the stream format deviating from the file format then?  In the
file format, IIUC, dictionaries can appear after the respective record
batches, so there's no way to tell whether the original or redefined
version of a dictionary is being referred to.

Regards

Antoine.

Re: [Format] Semantics for dictionary batches in streams

Posted by Wes McKinney <we...@gmail.com>.
So the current situation we have right now in C++ is that if we tried
to create an IPC stream from a sequence of record batches that don't
all have the same dictionary, we'd run into two scenarios:

* Batches that either have a prefix of a prior-observed dictionary, or
the prior dictionary is a prefix of their dictionary. For example,
suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
such case we could compute and send a delta batch

* Batches with a dictionary that is a permutation of values, and
possibly new unique values.

In this latter case, without the option of replacing an existing ID in
the stream, we would have to do a unification / permutation of indices
and then also possibly send a delta batch. We should probably have
code at some point that deals with both cases, but in the meantime I
would like to allow dictionaries to be redefined in this case. Seems
like we might need a vote to formalize this?

Independent from this decision, I would strongly recommend that all
implementations handle dictionaries in-memory as data and not metadata
(i.e. do not have dictionaries in the schema). It was lucky (see
ARROW-3144) that this problematic early design in the C++ library
could be fixed with less than a week of work.

Thanks
Wes

On Sun, Aug 11, 2019 at 9:17 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I'm not sure what you mean by record-in-dictionary-id, so it is possible
> this is a solution that I just don't understand :)
>
> The only two references to dictionary IDs that I could find, are  one in
> schema.fbs [1] which is attached a column in a schema and the one
> referenced above in DictionaryBatches define Message.fbs [2] for
> transmitting dictionaries.  It is quite possible I missed something.
>
>  The indices into the dictionary are Int Arrays in a normal record batch.
> I suppose the other option is to reset the stream by sending a new schema,
> but I don't think that is supported either. This is what lead to my
> original question.
>
> Does no one do this today?
>
> I think Wes did some recent work on the C++ Parquet in reading
> dictionaries, and might have faced some of these issues, I'm not sure how
> he dealt with it (haven't gotten back to the Parquet code yet).
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>
> On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> > Wow, you've shown how little I've thought about Arrow dictionaries for a
> > while. I thought we had a dictionary id and a record-in-dictionary-id.
> > Wouldn't that approach make more sense? Does no one do this today? (We
> > frequently use compound values for this type of scenario...)
> >
> > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Reading data from two different parquet files sequentially with different
> >> dictionaries for the same column.  This could be handled by re-encoding
> >> data but that seems potentially sub-optimal.
> >>
> >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>
> >>> What situation are anticipating where you're going to be restating ids
> >>> mid stream?
> >>>
> >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <em...@gmail.com>
> >>> wrote:
> >>>
> >>>> The IPC specification [1] defines behavior when isDelta on a
> >>>> DictionaryBatch [2] is "true".  I might have missed it in the
> >>>> specification, but I couldn't find the interpretation for what the
> >>>> expected
> >>>> behavior is when isDelta=false and and two  dictionary batches  with the
> >>>> same ID are sent.
> >>>>
> >>>> It seems like there are two options:
> >>>> 1.  Interpret the new dictionary batch as replacing the old one.
> >>>> 2.  Regard this as an error condition.
> >>>>
> >>>> Based on the fact that in the "file format" dictionaries are allowed to
> >>>> be
> >>>> placed in any order relative to the record batches, I assume it is the
> >>>> second, but just wanted to make sure.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> [1] https://arrow.apache.org/docs/ipc.html
> >>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
> >>>>
> >>>

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
I'm not sure what you mean by record-in-dictionary-id, so it is possible
this is a solution that I just don't understand :)

The only two references to dictionary IDs that I could find, are  one in
schema.fbs [1] which is attached a column in a schema and the one
referenced above in DictionaryBatches define Message.fbs [2] for
transmitting dictionaries.  It is quite possible I missed something.

 The indices into the dictionary are Int Arrays in a normal record batch.
I suppose the other option is to reset the stream by sending a new schema,
but I don't think that is supported either. This is what lead to my
original question.

Does no one do this today?

I think Wes did some recent work on the C++ Parquet in reading
dictionaries, and might have faced some of these issues, I'm not sure how
he dealt with it (haven't gotten back to the Parquet code yet).

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72

On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <ja...@apache.org> wrote:

> Wow, you've shown how little I've thought about Arrow dictionaries for a
> while. I thought we had a dictionary id and a record-in-dictionary-id.
> Wouldn't that approach make more sense? Does no one do this today? (We
> frequently use compound values for this type of scenario...)
>
> On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Reading data from two different parquet files sequentially with different
>> dictionaries for the same column.  This could be handled by re-encoding
>> data but that seems potentially sub-optimal.
>>
>> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>>> What situation are anticipating where you're going to be restating ids
>>> mid stream?
>>>
>>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> The IPC specification [1] defines behavior when isDelta on a
>>>> DictionaryBatch [2] is "true".  I might have missed it in the
>>>> specification, but I couldn't find the interpretation for what the
>>>> expected
>>>> behavior is when isDelta=false and and two  dictionary batches  with the
>>>> same ID are sent.
>>>>
>>>> It seems like there are two options:
>>>> 1.  Interpret the new dictionary batch as replacing the old one.
>>>> 2.  Regard this as an error condition.
>>>>
>>>> Based on the fact that in the "file format" dictionaries are allowed to
>>>> be
>>>> placed in any order relative to the record batches, I assume it is the
>>>> second, but just wanted to make sure.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1] https://arrow.apache.org/docs/ipc.html
>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>>>
>>>

Re: [Format] Semantics for dictionary batches in streams

Posted by Jacques Nadeau <ja...@apache.org>.
Wow, you've shown how little I've thought about Arrow dictionaries for a
while. I thought we had a dictionary id and a record-in-dictionary-id.
Wouldn't that approach make more sense? Does no one do this today? (We
frequently use compound values for this type of scenario...)

On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <em...@gmail.com>
wrote:

> Reading data from two different parquet files sequentially with different
> dictionaries for the same column.  This could be handled by re-encoding
> data but that seems potentially sub-optimal.
>
> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <ja...@apache.org>
> wrote:
>
>> What situation are anticipating where you're going to be restating ids
>> mid stream?
>>
>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> The IPC specification [1] defines behavior when isDelta on a
>>> DictionaryBatch [2] is "true".  I might have missed it in the
>>> specification, but I couldn't find the interpretation for what the
>>> expected
>>> behavior is when isDelta=false and and two  dictionary batches  with the
>>> same ID are sent.
>>>
>>> It seems like there are two options:
>>> 1.  Interpret the new dictionary batch as replacing the old one.
>>> 2.  Regard this as an error condition.
>>>
>>> Based on the fact that in the "file format" dictionaries are allowed to
>>> be
>>> placed in any order relative to the record batches, I assume it is the
>>> second, but just wanted to make sure.
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://arrow.apache.org/docs/ipc.html
>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>>
>>

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
Reading data from two different parquet files sequentially with different
dictionaries for the same column.  This could be handled by re-encoding
data but that seems potentially sub-optimal.

On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <ja...@apache.org> wrote:

> What situation are anticipating where you're going to be restating ids mid
> stream?
>
> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> The IPC specification [1] defines behavior when isDelta on a
>> DictionaryBatch [2] is "true".  I might have missed it in the
>> specification, but I couldn't find the interpretation for what the
>> expected
>> behavior is when isDelta=false and and two  dictionary batches  with the
>> same ID are sent.
>>
>> It seems like there are two options:
>> 1.  Interpret the new dictionary batch as replacing the old one.
>> 2.  Regard this as an error condition.
>>
>> Based on the fact that in the "file format" dictionaries are allowed to be
>> placed in any order relative to the record batches, I assume it is the
>> second, but just wanted to make sure.
>>
>> Thanks,
>> Micah
>>
>> [1] https://arrow.apache.org/docs/ipc.html
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>
>

Re: [Format] Semantics for dictionary batches in streams

Posted by Jacques Nadeau <ja...@apache.org>.
What situation are anticipating where you're going to be restating ids mid
stream?

On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <em...@gmail.com>
wrote:

> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true".  I might have missed it in the
> specification, but I couldn't find the interpretation for what the expected
> behavior is when isDelta=false and and two  dictionary batches  with the
> same ID are sent.
>
> It seems like there are two options:
> 1.  Interpret the new dictionary batch as replacing the old one.
> 2.  Regard this as an error condition.
>
> Based on the fact that in the "file format" dictionaries are allowed to be
> placed in any order relative to the record batches, I assume it is the
> second, but just wanted to make sure.
>
> Thanks,
> Micah
>
> [1] https://arrow.apache.org/docs/ipc.html
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>

Re: [Format] Semantics for dictionary batches in streams

Posted by Micah Kornfield <em...@gmail.com>.
I should add that Option #1 above would be my preference, even though it
adds some complications (especially for the file format).

On Sat, Aug 10, 2019 at 12:12 AM Micah Kornfield <em...@gmail.com>
wrote:

> The IPC specification [1] defines behavior when isDelta on a
> DictionaryBatch [2] is "true".  I might have missed it in the
> specification, but I couldn't find the interpretation for what the expected
> behavior is when isDelta=false and and two  dictionary batches  with the
> same ID are sent.
>
> It seems like there are two options:
> 1.  Interpret the new dictionary batch as replacing the old one.
> 2.  Regard this as an error condition.
>
> Based on the fact that in the "file format" dictionaries are allowed to be
> placed in any order relative to the record batches, I assume it is the
> second, but just wanted to make sure.
>
> Thanks,
> Micah
>
> [1] https://arrow.apache.org/docs/ipc.html
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>