You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Chris Nuernberger <ch...@techascent.com> on 2022/02/22 15:06:14 UTC

Dictionaries and multiple record batches

How are dictionaries intended to be used in a file with multiple record
batches?

I tried saving record-batch-specific dictionaries and got this error from
python:

 > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
dictionary delta in IPC file

This seems to defeat the purpose of having multiple record batches in a
single arrow file; the work around appears to be to either preprocess the
entire sequence of datasets to unify the dictionaries or save multiple
arrow files.

Re: Dictionaries and multiple record batches

Posted by Chris Nuernberger <ch...@techascent.com>.
If you are going to read all the dictionary blocks prior to reading any
record batch anyway there is for sure a way to make it work now without
changing the file format itself. I think, however, that if what is there
currently is working there is no meaningful advantage gained by adding
whatever it would take to make replacement dictionaries work.

On Tue, Feb 22, 2022 at 12:29 PM Micah Kornfield <em...@gmail.com>
wrote:

> I guess since the keys are only additive then you just create the master
>> dictionary before allowing random access to the data.
>
>
> Yes, this is what the implementation does.
>
> At some point we might want to create an updated file format that can
> handle replacements also, but this hasn't been a priority for anyone.
>
> On Tue, Feb 22, 2022 at 10:12 AM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> I guess since the keys are only additive then you just create the master
>> dictionary before allowing random access to the data.
>>
>> On Tue, Feb 22, 2022 at 11:08 AM Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> OK, thanks, I will work with delta dictionaries.
>>>
>>> How do delta dictionaries solve the random access issue?
>>>
>>> On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>
>>>> Dictionary replacement isn't supported in the file format because the
>>>> metadata makes it difficult to associate a particular dictionary with a
>>>> record batch for Random access.
>>>>
>>>> Delta dictionaries are supported but there was a long standing bug that
>>>> prevented there use in Python (
>>>> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
>>>> seeing issues in pyarrow 7.0 please open a bug.
>>>>
>>>> In regards to the usefulness of the file format without these features
>>>> that is really use case dependent.
>>>>
>>>> Cheers,
>>>> Micah
>>>>
>>>> On Tuesday, February 22, 2022, Chris Nuernberger <ch...@techascent.com>
>>>> wrote:
>>>>
>>>>> How are dictionaries intended to be used in a file with multiple
>>>>> record batches?
>>>>>
>>>>> I tried saving record-batch-specific dictionaries and got this error
>>>>> from python:
>>>>>
>>>>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>>>>> dictionary delta in IPC file
>>>>>
>>>>> This seems to defeat the purpose of having multiple record batches in
>>>>> a single arrow file; the work around appears to be to either preprocess the
>>>>> entire sequence of datasets to unify the dictionaries or save multiple
>>>>> arrow files.
>>>>>
>>>>

Re: Dictionaries and multiple record batches

Posted by Micah Kornfield <em...@gmail.com>.
>
> I guess since the keys are only additive then you just create the master
> dictionary before allowing random access to the data.


Yes, this is what the implementation does.

At some point we might want to create an updated file format that can
handle replacements also, but this hasn't been a priority for anyone.

On Tue, Feb 22, 2022 at 10:12 AM Chris Nuernberger <ch...@techascent.com>
wrote:

> I guess since the keys are only additive then you just create the master
> dictionary before allowing random access to the data.
>
> On Tue, Feb 22, 2022 at 11:08 AM Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> OK, thanks, I will work with delta dictionaries.
>>
>> How do delta dictionaries solve the random access issue?
>>
>> On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> Dictionary replacement isn't supported in the file format because the
>>> metadata makes it difficult to associate a particular dictionary with a
>>> record batch for Random access.
>>>
>>> Delta dictionaries are supported but there was a long standing bug that
>>> prevented there use in Python (
>>> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
>>> seeing issues in pyarrow 7.0 please open a bug.
>>>
>>> In regards to the usefulness of the file format without these features
>>> that is really use case dependent.
>>>
>>> Cheers,
>>> Micah
>>>
>>> On Tuesday, February 22, 2022, Chris Nuernberger <ch...@techascent.com>
>>> wrote:
>>>
>>>> How are dictionaries intended to be used in a file with multiple record
>>>> batches?
>>>>
>>>> I tried saving record-batch-specific dictionaries and got this error
>>>> from python:
>>>>
>>>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>>>> dictionary delta in IPC file
>>>>
>>>> This seems to defeat the purpose of having multiple record batches in a
>>>> single arrow file; the work around appears to be to either preprocess the
>>>> entire sequence of datasets to unify the dictionaries or save multiple
>>>> arrow files.
>>>>
>>>

Re: Dictionaries and multiple record batches

Posted by Chris Nuernberger <ch...@techascent.com>.
I guess since the keys are only additive then you just create the master
dictionary before allowing random access to the data.

On Tue, Feb 22, 2022 at 11:08 AM Chris Nuernberger <ch...@techascent.com>
wrote:

> OK, thanks, I will work with delta dictionaries.
>
> How do delta dictionaries solve the random access issue?
>
> On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Dictionary replacement isn't supported in the file format because the
>> metadata makes it difficult to associate a particular dictionary with a
>> record batch for Random access.
>>
>> Delta dictionaries are supported but there was a long standing bug that
>> prevented there use in Python (
>> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
>> seeing issues in pyarrow 7.0 please open a bug.
>>
>> In regards to the usefulness of the file format without these features
>> that is really use case dependent.
>>
>> Cheers,
>> Micah
>>
>> On Tuesday, February 22, 2022, Chris Nuernberger <ch...@techascent.com>
>> wrote:
>>
>>> How are dictionaries intended to be used in a file with multiple record
>>> batches?
>>>
>>> I tried saving record-batch-specific dictionaries and got this error
>>> from python:
>>>
>>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>>> dictionary delta in IPC file
>>>
>>> This seems to defeat the purpose of having multiple record batches in a
>>> single arrow file; the work around appears to be to either preprocess the
>>> entire sequence of datasets to unify the dictionaries or save multiple
>>> arrow files.
>>>
>>

Re: Dictionaries and multiple record batches

Posted by Chris Nuernberger <ch...@techascent.com>.
OK, thanks, I will work with delta dictionaries.

How do delta dictionaries solve the random access issue?

On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <em...@gmail.com>
wrote:

> Dictionary replacement isn't supported in the file format because the
> metadata makes it difficult to associate a particular dictionary with a
> record batch for Random access.
>
> Delta dictionaries are supported but there was a long standing bug that
> prevented there use in Python (
> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
> seeing issues in pyarrow 7.0 please open a bug.
>
> In regards to the usefulness of the file format without these features
> that is really use case dependent.
>
> Cheers,
> Micah
>
> On Tuesday, February 22, 2022, Chris Nuernberger <ch...@techascent.com>
> wrote:
>
>> How are dictionaries intended to be used in a file with multiple record
>> batches?
>>
>> I tried saving record-batch-specific dictionaries and got this error from
>> python:
>>
>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>> dictionary delta in IPC file
>>
>> This seems to defeat the purpose of having multiple record batches in a
>> single arrow file; the work around appears to be to either preprocess the
>> entire sequence of datasets to unify the dictionaries or save multiple
>> arrow files.
>>
>

Dictionaries and multiple record batches

Posted by Micah Kornfield <em...@gmail.com>.
Dictionary replacement isn't supported in the file format because the
metadata makes it difficult to associate a particular dictionary with a
record batch for Random access.

Delta dictionaries are supported but there was a long standing bug that
prevented there use in Python (
https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
seeing issues in pyarrow 7.0 please open a bug.

In regards to the usefulness of the file format without these features that
is really use case dependent.

Cheers,
Micah

On Tuesday, February 22, 2022, Chris Nuernberger <ch...@techascent.com>
wrote:

> How are dictionaries intended to be used in a file with multiple record
> batches?
>
> I tried saving record-batch-specific dictionaries and got this error from
> python:
>
>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
> dictionary delta in IPC file
>
> This seems to defeat the purpose of having multiple record batches in a
> single arrow file; the work around appears to be to either preprocess the
> entire sequence of datasets to unify the dictionaries or save multiple
> arrow files.
>