You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Hatem Helal <Ha...@mathworks.co.uk> on 2019/01/24 15:59:13 UTC

Round-trip of categorical data with Arrow and Parquet

Hi everyone,

I wanted to gauge interest and feasibility for adding support for natively reading an arrow::DictionaryArray from a parquet file.  Currently, writing an arrow::DictionaryArray is read back as the native index type [0].  I came across a prior discussion for this problem in the context of pandas [1] but I think this would be useful for other arrow clients (C++ or otherwise).

The solution I had in mind would be to add arrow type information as column metadata.  This metadata would then be used when reading back the parquet file to determine which arrow type to create for the column data.

I’m willing to contribute this feature but first wanted to get some feedback on whether this would be generally useful and if the high-level proposed solution would make sense.

Thanks!

Hatem


[0] This test demonstrates this behavior
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
[1] https://github.com/apache/arrow/issues/1688

Re: Round-trip of categorical data with Arrow and Parquet

Posted by Wes McKinney <we...@gmail.com>.
I implemented the first stage of this

https://github.com/apache/arrow/pull/3492

Basically what this allows is that column data decoders to write
directly into an Arrow builder class, including
BinaryDictionaryBuilder.

So once that's merged the remaining steps are:

* Define public API for requesting that a particular field be returned
as DictionaryArray
* in RecordReader, toggle between
arrow::internal::ChunkedBinaryBuilder and
arrow::BinaryDictionaryBuilder based on the option for that column

If we want to exactly preserve the dictionary values (for data that
originated in a DictionaryArray), more work is needed (and we need to
do this in order to faithfully round trip pandas.Categorical types).
The way I would do it:

* Add an API to BinaryDictionaryBuilder so that the hash table can be
"initialized" with the exact dictionary in the Parquet row group. This
method should return an indicator whether the encoded dictionary
indices can be inserted directly, skipping hashing (this will yield
significantly better performance)
* Add API to BinaryDictionaryBuilder that enables bulk-insert of
dictionary indices, bypassing hash logic

So in subsequent row groups there are two cases to consider

* Dictionary the same, or the prior dictionary is a "prefix" of the
new one: using that "initialization" method will verify that the
dictionary is the same, so that you can use the 2nd API to bulk-insert
the indices
* Dictionaries different: 2 possibilities, either you can hash the
values or compute the dictionary permutation and then insert that

So on that first thing, what I mean is that if we have

Row group 1, dictionary ['A', 'B', 'C']
Row group 2, dictionary ['A', 'B', 'C', 'D']

Then both row groups can have their indices bulk-appended without having to hash

thanks
Wes

On Fri, Jan 25, 2019 at 8:04 AM Wes McKinney <we...@gmail.com> wrote:
>
> I'm undertaking some refactoring to expose some of the decoding
> internals in a more direct way to the Arrow builder classes, see
> https://issues.apache.org/jira/browse/PARQUET-1508.
>
> I'll try to have a patch up sometime today or over the weekend for you
> to review (WIP:
> https://github.com/wesm/arrow/tree/parquet-decode-into-arrow-builder)
>
> After that we should be able to alter the Arrow read path to read
> directly into a arrow::BinaryDictionaryBuilder.
>
>
>
> On Thu, Jan 24, 2019 at 10:59 AM Hatem Helal
> <Ha...@mathworks.co.uk> wrote:
> >
> > Thanks Wes,
> >
> > Glad to hear this in your plan.
> >
> > I probably should have done this earlier...but here are some JIRA tickets that seem to cover this:
> >
> > https://issues.apache.org/jira/browse/ARROW-3772
> > https://issues.apache.org/jira/browse/ARROW-3325
> > https://issues.apache.org/jira/browse/ARROW-3769
> >
> >
> >
> > On 1/24/19, 4:27 PM, "Wes McKinney" <we...@gmail.com> wrote:
> >
> >     hi Hatem,
> >
> >     There are several issues open about this already (I'll have to dig
> >     them up), so this is something that we have desired for a long time,
> >     but have not gotten around to implementing.
> >
> >     Since many Parquet writers use dictionary encoding, it would make most
> >     sense to have an option to return DictionaryArray (which can be
> >     converted to pandas.Categorical) from any column, and internally we
> >     will perform the conversion from the encoded Parquet format as
> >     efficiently as possible.
> >
> >     There are many cases to consider:
> >
> >     * Dictionary encoded, but different dictionaries in each row group
> >     (this is actually the most likely scenario)
> >     * Dictionary encoded, but the same dictionary in all row groups
> >     * PLAIN encoded data that we pass through DictionaryBuilder as it is
> >     decoded to yield DictionaryArray
> >     * Dictionary encoded, but switch over to PLAIN encoding mid-stream
> >
> >     Having column metadata to automatically "opt in" to the
> >     DictionaryArray conversion sounds reasonable (so long as Arrow readers
> >     have a way to opt out, probably via a global flag to ignore such
> >     custom metadata fields) for usability.
> >
> >     Part of the reason this work was not done in the past was because some
> >     of our hash table machinery was a bit immature. Antoine has recently
> >     improved things significantly, so it should be a lot easier now to do
> >     this work. This is a quite large project, though, and one that affects
> >     a _lot_ of users, so I would be willing to take an initial pass on the
> >     implementation.
> >
> >     Along with completing the nested data read/write path I would say this
> >     is the 2nd highest priority project in parquet-cpp for Arrow users.
> >
> >     - Wes
> >
> >     On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
> >     >
> >     > Hi everyone,
> >     >
> >     > I wanted to gauge interest and feasibility for adding support for natively reading an arrow::DictionaryArray from a parquet file.  Currently, writing an arrow::DictionaryArray is read back as the native index type [0].  I came across a prior discussion for this problem in the context of pandas [1] but I think this would be useful for other arrow clients (C++ or otherwise).
> >     >
> >     > The solution I had in mind would be to add arrow type information as column metadata.  This metadata would then be used when reading back the parquet file to determine which arrow type to create for the column data.
> >     >
> >     > I’m willing to contribute this feature but first wanted to get some feedback on whether this would be generally useful and if the high-level proposed solution would make sense.
> >     >
> >     > Thanks!
> >     >
> >     > Hatem
> >     >
> >     >
> >     > [0] This test demonstrates this behavior
> >     > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
> >     > [1] https://github.com/apache/arrow/issues/1688
> >
> >

Re: Round-trip of categorical data with Arrow and Parquet

Posted by Wes McKinney <we...@gmail.com>.
I'm undertaking some refactoring to expose some of the decoding
internals in a more direct way to the Arrow builder classes, see
https://issues.apache.org/jira/browse/PARQUET-1508.

I'll try to have a patch up sometime today or over the weekend for you
to review (WIP:
https://github.com/wesm/arrow/tree/parquet-decode-into-arrow-builder)

After that we should be able to alter the Arrow read path to read
directly into a arrow::BinaryDictionaryBuilder.



On Thu, Jan 24, 2019 at 10:59 AM Hatem Helal
<Ha...@mathworks.co.uk> wrote:
>
> Thanks Wes,
>
> Glad to hear this in your plan.
>
> I probably should have done this earlier...but here are some JIRA tickets that seem to cover this:
>
> https://issues.apache.org/jira/browse/ARROW-3772
> https://issues.apache.org/jira/browse/ARROW-3325
> https://issues.apache.org/jira/browse/ARROW-3769
>
>
>
> On 1/24/19, 4:27 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     hi Hatem,
>
>     There are several issues open about this already (I'll have to dig
>     them up), so this is something that we have desired for a long time,
>     but have not gotten around to implementing.
>
>     Since many Parquet writers use dictionary encoding, it would make most
>     sense to have an option to return DictionaryArray (which can be
>     converted to pandas.Categorical) from any column, and internally we
>     will perform the conversion from the encoded Parquet format as
>     efficiently as possible.
>
>     There are many cases to consider:
>
>     * Dictionary encoded, but different dictionaries in each row group
>     (this is actually the most likely scenario)
>     * Dictionary encoded, but the same dictionary in all row groups
>     * PLAIN encoded data that we pass through DictionaryBuilder as it is
>     decoded to yield DictionaryArray
>     * Dictionary encoded, but switch over to PLAIN encoding mid-stream
>
>     Having column metadata to automatically "opt in" to the
>     DictionaryArray conversion sounds reasonable (so long as Arrow readers
>     have a way to opt out, probably via a global flag to ignore such
>     custom metadata fields) for usability.
>
>     Part of the reason this work was not done in the past was because some
>     of our hash table machinery was a bit immature. Antoine has recently
>     improved things significantly, so it should be a lot easier now to do
>     this work. This is a quite large project, though, and one that affects
>     a _lot_ of users, so I would be willing to take an initial pass on the
>     implementation.
>
>     Along with completing the nested data read/write path I would say this
>     is the 2nd highest priority project in parquet-cpp for Arrow users.
>
>     - Wes
>
>     On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
>     >
>     > Hi everyone,
>     >
>     > I wanted to gauge interest and feasibility for adding support for natively reading an arrow::DictionaryArray from a parquet file.  Currently, writing an arrow::DictionaryArray is read back as the native index type [0].  I came across a prior discussion for this problem in the context of pandas [1] but I think this would be useful for other arrow clients (C++ or otherwise).
>     >
>     > The solution I had in mind would be to add arrow type information as column metadata.  This metadata would then be used when reading back the parquet file to determine which arrow type to create for the column data.
>     >
>     > I’m willing to contribute this feature but first wanted to get some feedback on whether this would be generally useful and if the high-level proposed solution would make sense.
>     >
>     > Thanks!
>     >
>     > Hatem
>     >
>     >
>     > [0] This test demonstrates this behavior
>     > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
>     > [1] https://github.com/apache/arrow/issues/1688
>
>

Re: Round-trip of categorical data with Arrow and Parquet

Posted by Hatem Helal <Ha...@mathworks.co.uk>.
Thanks Wes,

Glad to hear this in your plan.  

I probably should have done this earlier...but here are some JIRA tickets that seem to cover this:

https://issues.apache.org/jira/browse/ARROW-3772
https://issues.apache.org/jira/browse/ARROW-3325
https://issues.apache.org/jira/browse/ARROW-3769



On 1/24/19, 4:27 PM, "Wes McKinney" <we...@gmail.com> wrote:

    hi Hatem,
    
    There are several issues open about this already (I'll have to dig
    them up), so this is something that we have desired for a long time,
    but have not gotten around to implementing.
    
    Since many Parquet writers use dictionary encoding, it would make most
    sense to have an option to return DictionaryArray (which can be
    converted to pandas.Categorical) from any column, and internally we
    will perform the conversion from the encoded Parquet format as
    efficiently as possible.
    
    There are many cases to consider:
    
    * Dictionary encoded, but different dictionaries in each row group
    (this is actually the most likely scenario)
    * Dictionary encoded, but the same dictionary in all row groups
    * PLAIN encoded data that we pass through DictionaryBuilder as it is
    decoded to yield DictionaryArray
    * Dictionary encoded, but switch over to PLAIN encoding mid-stream
    
    Having column metadata to automatically "opt in" to the
    DictionaryArray conversion sounds reasonable (so long as Arrow readers
    have a way to opt out, probably via a global flag to ignore such
    custom metadata fields) for usability.
    
    Part of the reason this work was not done in the past was because some
    of our hash table machinery was a bit immature. Antoine has recently
    improved things significantly, so it should be a lot easier now to do
    this work. This is a quite large project, though, and one that affects
    a _lot_ of users, so I would be willing to take an initial pass on the
    implementation.
    
    Along with completing the nested data read/write path I would say this
    is the 2nd highest priority project in parquet-cpp for Arrow users.
    
    - Wes
    
    On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
    >
    > Hi everyone,
    >
    > I wanted to gauge interest and feasibility for adding support for natively reading an arrow::DictionaryArray from a parquet file.  Currently, writing an arrow::DictionaryArray is read back as the native index type [0].  I came across a prior discussion for this problem in the context of pandas [1] but I think this would be useful for other arrow clients (C++ or otherwise).
    >
    > The solution I had in mind would be to add arrow type information as column metadata.  This metadata would then be used when reading back the parquet file to determine which arrow type to create for the column data.
    >
    > I’m willing to contribute this feature but first wanted to get some feedback on whether this would be generally useful and if the high-level proposed solution would make sense.
    >
    > Thanks!
    >
    > Hatem
    >
    >
    > [0] This test demonstrates this behavior
    > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
    > [1] https://github.com/apache/arrow/issues/1688
    


Re: Round-trip of categorical data with Arrow and Parquet

Posted by Wes McKinney <we...@gmail.com>.
hi Hatem,

There are several issues open about this already (I'll have to dig
them up), so this is something that we have desired for a long time,
but have not gotten around to implementing.

Since many Parquet writers use dictionary encoding, it would make most
sense to have an option to return DictionaryArray (which can be
converted to pandas.Categorical) from any column, and internally we
will perform the conversion from the encoded Parquet format as
efficiently as possible.

There are many cases to consider:

* Dictionary encoded, but different dictionaries in each row group
(this is actually the most likely scenario)
* Dictionary encoded, but the same dictionary in all row groups
* PLAIN encoded data that we pass through DictionaryBuilder as it is
decoded to yield DictionaryArray
* Dictionary encoded, but switch over to PLAIN encoding mid-stream

Having column metadata to automatically "opt in" to the
DictionaryArray conversion sounds reasonable (so long as Arrow readers
have a way to opt out, probably via a global flag to ignore such
custom metadata fields) for usability.

Part of the reason this work was not done in the past was because some
of our hash table machinery was a bit immature. Antoine has recently
improved things significantly, so it should be a lot easier now to do
this work. This is a quite large project, though, and one that affects
a _lot_ of users, so I would be willing to take an initial pass on the
implementation.

Along with completing the nested data read/write path I would say this
is the 2nd highest priority project in parquet-cpp for Arrow users.

- Wes

On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <Ha...@mathworks.co.uk> wrote:
>
> Hi everyone,
>
> I wanted to gauge interest and feasibility for adding support for natively reading an arrow::DictionaryArray from a parquet file.  Currently, writing an arrow::DictionaryArray is read back as the native index type [0].  I came across a prior discussion for this problem in the context of pandas [1] but I think this would be useful for other arrow clients (C++ or otherwise).
>
> The solution I had in mind would be to add arrow type information as column metadata.  This metadata would then be used when reading back the parquet file to determine which arrow type to create for the column data.
>
> I’m willing to contribute this feature but first wanted to get some feedback on whether this would be generally useful and if the high-level proposed solution would make sense.
>
> Thanks!
>
> Hatem
>
>
> [0] This test demonstrates this behavior
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
> [1] https://github.com/apache/arrow/issues/1688