You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Thomas Buhrmann <th...@gmail.com> on 2019/10/14 13:33:19 UTC

Batch writing/reading tables with varying dictionary (in v0.14.1)

Hi,
My use case involves processing large datasets in batches (of rows), each
batch resulting in a DataFrame that I'm serializing to a single file on
disk via RecordBatchStreamWriter (to end up with a file that can in turn be
read in batches). My problem is that some columns are pandas categorical
types, for which I can't know ahead of time all the possible categories.
And since the RecordBatchStreamWriter accepts only a single schema, I can't
seem to find a way to update the Arrow dictionary, or write a new schema
for each RecordBatch. This results in an invalid stream/file with
dictionary indices that don't match the schema. Is there currently a way to
do this using the high-level APIs? Or would I have to manually construct
the stream using each batch's schema etc.?

It seems that this may be related to the open issues in ARROW-3144
<https://issues.apache.org/jira/browse/ARROW-3144> (ARROW-5279
<https://issues.apache.org/jira/browse/ARROW-5279>, ARROW-5336
<https://issues.apache.org/jira/browse/ARROW-5336>) and the discussion in
PR-3165 <https://github.com/apache/arrow/pull/3165>, from which I
understand that this may be supported already when writing to parquet, but
not in IPC? Is there any other workaround I could use right now?

Many thanks,
T

Re: Batch writing/reading tables with varying dictionary (in v0.14.1)

Posted by Thomas Buhrmann <th...@gmail.com>.

Ok, thanks for letting me know! I assume the same holds for the file writer
class and will keep an eye on the thread...

On Mon, 14 Oct 2019 at 22:56, Wes McKinney <we...@gmail.com> wrote:

> hi Thomas,
>
> The stream writer class currently only supports a constant dictionary.
> The work in ARROW-3144 moved the dictionary out of the schema and into
> the DictionaryArray data structure, so this is necessary to allow
> changing dictionaries in a stream.
>
> To support your use case, we either need dictionary deltas or
> dictionary replacements to be implemented. These are provided for in
> the format, but have not been implemented yet in C++.
>
> Note there's a mailing list thread on dev@ going on right now about
> finalizing low level details of dictionary encoding in the columnar
> format specification
>
>
> https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
>
> I just opened https://issues.apache.org/jira/browse/ARROW-6883 since I
> didn't see another issue covering this
>
> - Wes
>
> On Mon, Oct 14, 2019 at 8:41 AM Thomas Buhrmann
> <th...@gmail.com> wrote:
> >
> > Hi,
> > My use case involves processing large datasets in batches (of rows),
> each batch resulting in a DataFrame that I'm serializing to a single file
> on disk via RecordBatchStreamWriter (to end up with a file that can in turn
> be read in batches). My problem is that some columns are pandas categorical
> types, for which I can't know ahead of time all the possible categories.
> And since the RecordBatchStreamWriter accepts only a single schema, I can't
> seem to find a way to update the Arrow dictionary, or write a new schema
> for each RecordBatch. This results in an invalid stream/file with
> dictionary indices that don't match the schema. Is there currently a way to
> do this using the high-level APIs? Or would I have to manually construct
> the stream using each batch's schema etc.?
> >
> > It seems that this may be related to the open issues in ARROW-3144
> (ARROW-5279, ARROW-5336) and the discussion in PR-3165, from which I
> understand that this may be supported already when writing to parquet, but
> not in IPC? Is there any other workaround I could use right now?
> >
> > Many thanks,
> > T
>

Re: Batch writing/reading tables with varying dictionary (in v0.14.1)

Posted by Wes McKinney <we...@gmail.com>.

hi Thomas,

The stream writer class currently only supports a constant dictionary.
The work in ARROW-3144 moved the dictionary out of the schema and into
the DictionaryArray data structure, so this is necessary to allow
changing dictionaries in a stream.

To support your use case, we either need dictionary deltas or
dictionary replacements to be implemented. These are provided for in
the format, but have not been implemented yet in C++.

Note there's a mailing list thread on dev@ going on right now about
finalizing low level details of dictionary encoding in the columnar
format specification

https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E

I just opened https://issues.apache.org/jira/browse/ARROW-6883 since I
didn't see another issue covering this

- Wes

On Mon, Oct 14, 2019 at 8:41 AM Thomas Buhrmann
<th...@gmail.com> wrote:
>
> Hi,
> My use case involves processing large datasets in batches (of rows), each batch resulting in a DataFrame that I'm serializing to a single file on disk via RecordBatchStreamWriter (to end up with a file that can in turn be read in batches). My problem is that some columns are pandas categorical types, for which I can't know ahead of time all the possible categories. And since the RecordBatchStreamWriter accepts only a single schema, I can't seem to find a way to update the Arrow dictionary, or write a new schema for each RecordBatch. This results in an invalid stream/file with dictionary indices that don't match the schema. Is there currently a way to do this using the high-level APIs? Or would I have to manually construct the stream using each batch's schema etc.?
>
> It seems that this may be related to the open issues in ARROW-3144 (ARROW-5279, ARROW-5336) and the discussion in PR-3165, from which I understand that this may be supported already when writing to parquet, but not in IPC? Is there any other workaround I could use right now?
>
> Many thanks,
> T