You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2019/08/02 15:55:02 UTC

[C++][Python] Direct Arrow DictionaryArray reads from Parquet files

I've been working (with Hatem Helal's assistance!) the last few months
to put the pieces in place to enable reading BYTE_ARRAY columns in
Parquet files directly to Arrow DictionaryArray. As context, it's not
uncommon for a Parquet file to occupy ~100x less (even greater
compression factor) space on disk than fully-decoded in memory when
there are a lot of common strings. Users get frustrated sometimes when
they read a "small" Parquet file and have memory use problems.

I made a benchmark to exhibit an example "worst case scenario"

https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7

In this example, we have a table with a single column containing 10
million values drawn from a dictionary of 1000 values that's about 50
kilobytes in size. Written to Parquet, the file a little over 1
megabyte due to Parquet's layers of compression. But read naively to
Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
bytes per value). With the new decoding machinery, we can skip the
dense decoding of the binary data and append the Parquet file's
internal dictionary indices directly into an arrow::DictionaryBuilder,
yielding a DictionaryArray at the end. The end result uses less than
10% as much memory (about 40MB compared with 500MB) and is almost 20x
faster to decode.

The PR making this available finally in Python is here:
https://github.com/apache/arrow/pull/4999

Complex, multi-layered projects like this can be a little bit
inscrutable when discussed strictly at a code/technical level, but I
hope this helps show that employing dictionary encoding can have a lot
of user impact both in memory use and performance.

- Wes

Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Posted by 俊杰陈 <cj...@gmail.com>.

I remember I encountered an OOM when using spark broadcast when AE is
enabled. which seems like the same issue here. The AE takes the
compressed data into consideration but decompress the data when
broadcasting and thus throws OOM.

On Fri, Aug 2, 2019 at 11:55 PM Wes McKinney <we...@gmail.com> wrote:
>
> I've been working (with Hatem Helal's assistance!) the last few months
> to put the pieces in place to enable reading BYTE_ARRAY columns in
> Parquet files directly to Arrow DictionaryArray. As context, it's not
> uncommon for a Parquet file to occupy ~100x less (even greater
> compression factor) space on disk than fully-decoded in memory when
> there are a lot of common strings. Users get frustrated sometimes when
> they read a "small" Parquet file and have memory use problems.
>
> I made a benchmark to exhibit an example "worst case scenario"
>
> https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
>
> In this example, we have a table with a single column containing 10
> million values drawn from a dictionary of 1000 values that's about 50
> kilobytes in size. Written to Parquet, the file a little over 1
> megabyte due to Parquet's layers of compression. But read naively to
> Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
> bytes per value). With the new decoding machinery, we can skip the
> dense decoding of the binary data and append the Parquet file's
> internal dictionary indices directly into an arrow::DictionaryBuilder,
> yielding a DictionaryArray at the end. The end result uses less than
> 10% as much memory (about 40MB compared with 500MB) and is almost 20x
> faster to decode.
>
> The PR making this available finally in Python is here:
> https://github.com/apache/arrow/pull/4999
>
> Complex, multi-layered projects like this can be a little bit
> inscrutable when discussed strictly at a code/technical level, but I
> hope this helps show that employing dictionary encoding can have a lot
> of user impact both in memory use and performance.
>
> - Wes



-- 
Thanks & Best Regards

Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Posted by Wes McKinney <we...@gmail.com>.

hi Hatem -- I was planning to look at the round-trip question here
early this week, since I have all the code fresh in my mind let me
have a look and I'll report back.

Accurately preserving the original dictionary in a round trip tricky
because the low-level column writer doesn't expose any detail about
what it's doing with each batch of data. At minimum if we
automatically set "read_dictionary" based on what the original schema
was then that gets us part of the way there.

- Wes

On Mon, Aug 5, 2019 at 5:34 AM Hatem Helal <hh...@mathworks.com> wrote:
>
> Thanks for sharing this very illustrative benchmark.  Really nice to see the huge benefit for languages that have a type for modelling categorical data.
>
> I'm interested in whether we can make the parquet/arrow integration automatically handle the round-trip for Arrow DictionaryArrays.  We've had this requested from users of the MATLAB-Parquet integration.  We've suggested workarounds for those users but as your benchmark shows, you need to have enough memory to store the "dense" representation.  I think this could be solved by writing metadata with the Arrow data type.  An added benefit of doing this at the Arrow-level is that any language that uses the C++ parquet/arrow integration could round-trip DictionaryArrays.
>
> I'm not currently sure how all the pieces would fit together but let me know if there is interest and I'm happy to flesh this out as a PR.
>
>
> On 8/2/19, 4:55 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     I've been working (with Hatem Helal's assistance!) the last few months
>     to put the pieces in place to enable reading BYTE_ARRAY columns in
>     Parquet files directly to Arrow DictionaryArray. As context, it's not
>     uncommon for a Parquet file to occupy ~100x less (even greater
>     compression factor) space on disk than fully-decoded in memory when
>     there are a lot of common strings. Users get frustrated sometimes when
>     they read a "small" Parquet file and have memory use problems.
>
>     I made a benchmark to exhibit an example "worst case scenario"
>
>     https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
>
>     In this example, we have a table with a single column containing 10
>     million values drawn from a dictionary of 1000 values that's about 50
>     kilobytes in size. Written to Parquet, the file a little over 1
>     megabyte due to Parquet's layers of compression. But read naively to
>     Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
>     bytes per value). With the new decoding machinery, we can skip the
>     dense decoding of the binary data and append the Parquet file's
>     internal dictionary indices directly into an arrow::DictionaryBuilder,
>     yielding a DictionaryArray at the end. The end result uses less than
>     10% as much memory (about 40MB compared with 500MB) and is almost 20x
>     faster to decode.
>
>     The PR making this available finally in Python is here:
>     https://github.com/apache/arrow/pull/4999
>
>     Complex, multi-layered projects like this can be a little bit
>     inscrutable when discussed strictly at a code/technical level, but I
>     hope this helps show that employing dictionary encoding can have a lot
>     of user impact both in memory use and performance.
>
>     - Wes
>
>

Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Posted by Wes McKinney <we...@gmail.com>.

hi Hatem -- I was planning to look at the round-trip question here
early this week, since I have all the code fresh in my mind let me
have a look and I'll report back.

Accurately preserving the original dictionary in a round trip tricky
because the low-level column writer doesn't expose any detail about
what it's doing with each batch of data. At minimum if we
automatically set "read_dictionary" based on what the original schema
was then that gets us part of the way there.

- Wes

On Mon, Aug 5, 2019 at 5:34 AM Hatem Helal <hh...@mathworks.com> wrote:
>
> Thanks for sharing this very illustrative benchmark.  Really nice to see the huge benefit for languages that have a type for modelling categorical data.
>
> I'm interested in whether we can make the parquet/arrow integration automatically handle the round-trip for Arrow DictionaryArrays.  We've had this requested from users of the MATLAB-Parquet integration.  We've suggested workarounds for those users but as your benchmark shows, you need to have enough memory to store the "dense" representation.  I think this could be solved by writing metadata with the Arrow data type.  An added benefit of doing this at the Arrow-level is that any language that uses the C++ parquet/arrow integration could round-trip DictionaryArrays.
>
> I'm not currently sure how all the pieces would fit together but let me know if there is interest and I'm happy to flesh this out as a PR.
>
>
> On 8/2/19, 4:55 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     I've been working (with Hatem Helal's assistance!) the last few months
>     to put the pieces in place to enable reading BYTE_ARRAY columns in
>     Parquet files directly to Arrow DictionaryArray. As context, it's not
>     uncommon for a Parquet file to occupy ~100x less (even greater
>     compression factor) space on disk than fully-decoded in memory when
>     there are a lot of common strings. Users get frustrated sometimes when
>     they read a "small" Parquet file and have memory use problems.
>
>     I made a benchmark to exhibit an example "worst case scenario"
>
>     https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
>
>     In this example, we have a table with a single column containing 10
>     million values drawn from a dictionary of 1000 values that's about 50
>     kilobytes in size. Written to Parquet, the file a little over 1
>     megabyte due to Parquet's layers of compression. But read naively to
>     Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
>     bytes per value). With the new decoding machinery, we can skip the
>     dense decoding of the binary data and append the Parquet file's
>     internal dictionary indices directly into an arrow::DictionaryBuilder,
>     yielding a DictionaryArray at the end. The end result uses less than
>     10% as much memory (about 40MB compared with 500MB) and is almost 20x
>     faster to decode.
>
>     The PR making this available finally in Python is here:
>     https://github.com/apache/arrow/pull/4999
>
>     Complex, multi-layered projects like this can be a little bit
>     inscrutable when discussed strictly at a code/technical level, but I
>     hope this helps show that employing dictionary encoding can have a lot
>     of user impact both in memory use and performance.
>
>     - Wes
>
>

Re: [C++][Python] Direct Arrow DictionaryArray reads from Parquet files

Posted by Hatem Helal <hh...@mathworks.com>.

Thanks for sharing this very illustrative benchmark.  Really nice to see the huge benefit for languages that have a type for modelling categorical data.

I'm interested in whether we can make the parquet/arrow integration automatically handle the round-trip for Arrow DictionaryArrays.  We've had this requested from users of the MATLAB-Parquet integration.  We've suggested workarounds for those users but as your benchmark shows, you need to have enough memory to store the "dense" representation.  I think this could be solved by writing metadata with the Arrow data type.  An added benefit of doing this at the Arrow-level is that any language that uses the C++ parquet/arrow integration could round-trip DictionaryArrays.

I'm not currently sure how all the pieces would fit together but let me know if there is interest and I'm happy to flesh this out as a PR.


On 8/2/19, 4:55 PM, "Wes McKinney" <we...@gmail.com> wrote:

    I've been working (with Hatem Helal's assistance!) the last few months
    to put the pieces in place to enable reading BYTE_ARRAY columns in
    Parquet files directly to Arrow DictionaryArray. As context, it's not
    uncommon for a Parquet file to occupy ~100x less (even greater
    compression factor) space on disk than fully-decoded in memory when
    there are a lot of common strings. Users get frustrated sometimes when
    they read a "small" Parquet file and have memory use problems.
    
    I made a benchmark to exhibit an example "worst case scenario"
    
    https://gist.github.com/wesm/450d85e52844aee685c0680111cbb1d7
    
    In this example, we have a table with a single column containing 10
    million values drawn from a dictionary of 1000 values that's about 50
    kilobytes in size. Written to Parquet, the file a little over 1
    megabyte due to Parquet's layers of compression. But read naively to
    Arrow BinaryArray, about 500MB of memory is taken up (10M values * 54
    bytes per value). With the new decoding machinery, we can skip the
    dense decoding of the binary data and append the Parquet file's
    internal dictionary indices directly into an arrow::DictionaryBuilder,
    yielding a DictionaryArray at the end. The end result uses less than
    10% as much memory (about 40MB compared with 500MB) and is almost 20x
    faster to decode.
    
    The PR making this available finally in Python is here:
    https://github.com/apache/arrow/pull/4999
    
    Complex, multi-layered projects like this can be a little bit
    inscrutable when discussed strictly at a code/technical level, but I
    hope this helps show that employing dictionary encoding can have a lot
    of user impact both in memory use and performance.
    
    - Wes