You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Sarah Gilmore <sg...@mathworks.com> on 2021/11/15 20:22:19 UTC

[Parquet][C++][Python] Maximum Row Group Length Default

Hi all,

I was wondering if anyone could elaborate on why the default maximum row group length is set to 67108864<https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>. From Apache Parquet's documentation, the recommended row group size is between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/> For a Float64Array whose length is 67108864, I believe its size would be approximately 545 MB, which is on the low end of that interval.

I was wondering if there was a particular reason why 67108864 was chosen as the maximum row group length. I experimented with setting the default maximum row group length to larger values and noticed pyarrow cannot import Parquet files containing row groups whose lengths exceed 2147483647 rows (int32 max). However, I was able to read these files in using the C++ Arrow bindings.


Best,
Sarah

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Weston Pace <we...@gmail.com>.

Somewhat, though maybe not as bad.  The arrow format only lists the
schema once and the per-batch data is just lengths.  For disk size I
ran an experiment on 100k rows x 10k columns of float64 and got:

-rw-rw-r--  1 pace pace  8005840602 Nov 22 09:35 10_batches.arrow
-rw-rw-r--  1 pace pace  8049051402 Nov 22 09:40 100_batches.arrow
-rw-rw-r--  1 pace pace  8481159402 Nov 22 09:51 1000_batches.arrow
-rw-rw-r--  1 pace pace  9786751894 Nov 22 09:52 10_batches.parquet
-rw-rw-r--  1 pace pace  9577577138 Nov 22 09:53 100_batches.parquet
-rw-rw-r--  1 pace pace 12107949064 Nov 22 10:12 1000_batches.parquet

Note, this is something of a pathological case for parquet as random
float data is incompressible so don't focus too much on arrow vs
parquet.  I'm trying to illustrate the effect of having more record
batches.  I'm a little surprised by the difference between 10 and 100
batches in parquet but maybe someone else can guess.

Also, note that the row groups don't have to be super large.  Between
10 & 100 the effect is pretty small compared to the size of the data.
It's not really noticeable until you hit 1000 batches (which means 100
rows per batch).

Size on disk is only half the problem.  The other question is
processing time.  That's a little harder to measure, and it's
something that is always improving or has the potential to improve.
For example, I timed reading a single column from those files into
disk with the datasets API.  For each read I recorded the time of the
second read so this is cached-in-memory reads with little to no I/O
time.

IPC:
10 batches: 2.2 seconds
100 batches: 2.6 seconds
1000 batches: 4.9 seconds

Parquet:
10 batches: 0.25 seconds
100 batches: 1.75 seconds
1000 batches: 16.87 seconds

Again, don't worry too much about IPC vs. Parquet.  These results are
from 6.0.1 and I'm not using memory mapped files.  The performance in
7.0.0 will probably be better for this test because of better IPC
support for column projection pushdown (thanks Yue Ni!)

On Mon, Nov 22, 2021 at 8:17 AM Aldrin <ak...@ucsc.edu.invalid> wrote:
>
> Hi Weston,
>
> This is slightly off-topic, but I'm curious if what you mentioned about the
> large metadata blocks (inlined below) also applies to IPC format?
>
> I am working with matrices and representing them as tables that can have
> hundreds of thousands of columns, but I'm splitting them into row groups to
> apply push down predicates.
>
> Finally, one other issue that comes into play, is the width of your
> > data.  Really wide datasets (e.g. tens of thousands of columns) suffer
> > from having rather large metadata blocks.  If your row groups start to
> > get small then you end up spending a lot of time parsing metadata and
> > much less time actually reading data.
>
>
>
> Thanks!
>
> > --
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Aldrin <ak...@ucsc.edu.INVALID>.

Hi Weston,

This is slightly off-topic, but I'm curious if what you mentioned about the
large metadata blocks (inlined below) also applies to IPC format?

I am working with matrices and representing them as tables that can have
hundreds of thousands of columns, but I'm splitting them into row groups to
apply push down predicates.

Finally, one other issue that comes into play, is the width of your
> data.  Really wide datasets (e.g. tens of thousands of columns) suffer
> from having rather large metadata blocks.  If your row groups start to
> get small then you end up spending a lot of time parsing metadata and
> much less time actually reading data.



Thanks!

> --

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Weston Pace <we...@gmail.com>.

> What are the tradeoffs between a low and large and row group size?

I can give some perspective from the C++ work I've been doing.  I
believe this inspired some of the recommendations Jon is referring to.
At the moment we have a number of limitations that aren't limitations
in the format but more limitations in the Arrow C++ library and how we
consume the data.

The most significant issue is memory pressure, especially in the C++
dataset scanner.  You can see an example here[1].  This isn't entirely
inevitable though, even if row groups are large.  The main problem
there is that the C++ dataset scanner implements readahead on a
"readahead X # of row groups" basis and not "readahead X # of bytes
basis".  I have a ticket in place[2] which will hopefully improve the
dataset scanner's ability to work with files that have large row
groups, but I'm not actively working on it.  That should generally
allow for larger row groups without blowing up RAM when running
queries through the C++ compute layer.

In addition, we may investigate better mechanisms for "sliced reads"
of both parquet and IPC.  For example, the Arrow C++ parquet reader
can only read an entire row group at a time.  When my input is S3 I am
often scanning multiple batches at once.  If a row group is 1GB and I
am scanning ten files in parallel then I end up with 10GB of "cached
data" that I then slowly scan through.  This is wasteful from a RAM
perspective if I'm trying to do streaming processing.  Parquet could
support reads at data-page resolution (the underlying parquet-cpp lib
may actually have this, I haven't looked too deeply yet) which would
solve this problem.  Also, the IPC format could support a similar
concept at any desired resolution.  Some work could be done in both
cases to make sure we only read / decode the metadata once.

Another way to solve the above problem is to rely on more parallel
reads within a single record batch and deprioritize reading multiple
batches or multiple files at once.  But then that can get kind of
tricky to manage if you have a large number of small files.

> Is it that a low value allows for quicker random access (as we can seek row
> groups based on the number of rows they have), while a larger value allows
> for higher dict-encoding and compression ratios?

Similar to this is the idea of push-down predicates.  If the row
groups are smaller, then there is a greater chance that you can skip
entire row groups based on the statistics.  This only applies to
parquet since we don't have any group statistics in the IPC format.
Another way to tackle this problem would be to rely more on data page
statistics.  However, the C++ Arrow parquet reader doesn't support
using page-level statistics for push down predicates.

Finally, one other issue that comes into play, is the width of your
data.  Really wide datasets (e.g. tens of thousands of columns) suffer
from having rather large metadata blocks.  If your row groups start to
get small then you end up spending a lot of time parsing metadata and
much less time actually reading data.

[1] https://issues.apache.org/jira/browse/ARROW-14736
[2] https://issues.apache.org/jira/browse/ARROW-14648

On Wed, Nov 17, 2021 at 10:22 AM Jorge Cardoso Leitão
<jo...@gmail.com> wrote:
>
> What are the tradeoffs between a low and large and row group size?
>
> Is it that a low value allows for quicker random access (as we can seek row
> groups based on the number of rows they have), while a larger value allows
> for higher dict-encoding and compression ratios?
>
> Best,
> Jorge
>
>
>
>
> On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jk...@gmail.com> wrote:
>
> > This doesn't address the large number of row groups ticket that was
> > raised, but for some visibility: there is some work to change the row
> > group sizing based on the size of data instead of a static number of
> > rows [1] as well as exposing a few more knobs to tune [2]
> >
> > There is a bit of prior art in the R implementation for attempting to
> > get a reasonable row group size based on the shape of the data
> > (basically, aims to have row groups that have 250 Million cells in
> > them). [3]
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-4542
> > [2] https://issues.apache.org/jira/browse/ARROW-14426 and
> > https://issues.apache.org/jira/browse/ARROW-14427
> > [3]
> > https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222
> >
> > -Jon
> >
> > On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
> > <jo...@gmail.com> wrote:
> > >
> > > In addition, would it be useful to be able to change this
> > max_row_group_length
> > > from Python?
> > > Currently that writer property can't be changed from Python, you can only
> > > specify the row_group_size (chunk_size in C++)
> > > when writing a table, but that's currently only useful to set it to
> > > something that is smaller than the max_row_group_length.
> > >
> > > Joris
> >

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

What are the tradeoffs between a low and large and row group size?

Is it that a low value allows for quicker random access (as we can seek row
groups based on the number of rows they have), while a larger value allows
for higher dict-encoding and compression ratios?

Best,
Jorge




On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jk...@gmail.com> wrote:

> This doesn't address the large number of row groups ticket that was
> raised, but for some visibility: there is some work to change the row
> group sizing based on the size of data instead of a static number of
> rows [1] as well as exposing a few more knobs to tune [2]
>
> There is a bit of prior art in the R implementation for attempting to
> get a reasonable row group size based on the shape of the data
> (basically, aims to have row groups that have 250 Million cells in
> them). [3]
>
> [1] https://issues.apache.org/jira/browse/ARROW-4542
> [2] https://issues.apache.org/jira/browse/ARROW-14426 and
> https://issues.apache.org/jira/browse/ARROW-14427
> [3]
> https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222
>
> -Jon
>
> On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > In addition, would it be useful to be able to change this
> max_row_group_length
> > from Python?
> > Currently that writer property can't be changed from Python, you can only
> > specify the row_group_size (chunk_size in C++)
> > when writing a table, but that's currently only useful to set it to
> > something that is smaller than the max_row_group_length.
> >
> > Joris
>

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Jonathan Keane <jk...@gmail.com>.

This doesn't address the large number of row groups ticket that was
raised, but for some visibility: there is some work to change the row
group sizing based on the size of data instead of a static number of
rows [1] as well as exposing a few more knobs to tune [2]

There is a bit of prior art in the R implementation for attempting to
get a reasonable row group size based on the shape of the data
(basically, aims to have row groups that have 250 Million cells in
them). [3]

[1] https://issues.apache.org/jira/browse/ARROW-4542
[2] https://issues.apache.org/jira/browse/ARROW-14426 and
https://issues.apache.org/jira/browse/ARROW-14427
[3] https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222

-Jon

On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> In addition, would it be useful to be able to change this max_row_group_length
> from Python?
> Currently that writer property can't be changed from Python, you can only
> specify the row_group_size (chunk_size in C++)
> when writing a table, but that's currently only useful to set it to
> something that is smaller than the max_row_group_length.
>
> Joris

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Joris Van den Bossche <jo...@gmail.com>.

In addition, would it be useful to be able to change this max_row_group_length
from Python?
Currently that writer property can't be changed from Python, you can only
specify the row_group_size (chunk_size in C++)
when writing a table, but that's currently only useful to set it to
something that is smaller than the max_row_group_length.

Joris

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Sarah Gilmore <sg...@mathworks.com>.

Hi Micah,

Thanks for the clarifying! I just created this<https://issues.apache.org/jira/browse/ARROW-14723> Jira issue to track the issue with Pyarrow.

Thanks again!

Sarah
________________________________
From: Micah Kornfield <em...@gmail.com>
Sent: Monday, November 15, 2021 3:34 PM
To: dev <de...@arrow.apache.org>
Subject: Re: [Parquet][C++][Python] Maximum Row Group Length Default

>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97<https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/<https://parquet.apache.org/documentation/latest>>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.


I don't think we currently have any heuristic around row group size and the
row count (we probably should try adding one). Even the default seems
pretty high, since in general parquet files are going to have more then one
column per row group.


> I experimented with setting the default maximum row group length to larger
> values and noticed pyarrow cannot import Parquet files containing row
> groups whose lengths exceed 2147483647 rows (int32 max). However, I was
> able to read these files in using the C++ Arrow bindings.

This is surprising, and without seeing the exact error it sounds like a
bug. Could you open a JIRA to discuss (or check if there is already one
tracking this).


On Mon, Nov 15, 2021 at 12:23 PM Sarah Gilmore <sg...@mathworks.com>
wrote:

> Hi all,
>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97<https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/<https://parquet.apache.org/documentation/latest/>>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.
>
> I was wondering if there was a particular reason why 67108864 was chosen
> as the maximum row group length. I experimented with setting the default
> maximum row group length to larger values and noticed pyarrow cannot import
> Parquet files containing row groups whose lengths exceed 2147483647 rows
> (int32 max). However, I was able to read these files in using the C++ Arrow
> bindings.
>
>
> Best,
> Sarah
>
>
>

Re: [Parquet][C++][Python] Maximum Row Group Length Default

Posted by Micah Kornfield <em...@gmail.com>.

>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.


I don't think we currently have any heuristic around row group size and the
row count (we probably should try adding one).  Even the default seems
pretty high, since in general parquet files are going to have more then one
column per row group.


> I experimented with setting the default maximum row group length to larger
> values and noticed pyarrow cannot import Parquet files containing row
> groups whose lengths exceed 2147483647 rows (int32 max). However, I was
> able to read these files in using the C++ Arrow bindings.

This is surprising, and without seeing the exact error it sounds like  a
bug.  Could you open a JIRA to discuss (or check if there is already one
tracking this).


On Mon, Nov 15, 2021 at 12:23 PM Sarah Gilmore <sg...@mathworks.com>
wrote:

> Hi all,
>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.
>
> I was wondering if there was a particular reason why 67108864 was chosen
> as the maximum row group length. I experimented with setting the default
> maximum row group length to larger values and noticed pyarrow cannot import
> Parquet files containing row groups whose lengths exceed 2147483647 rows
> (int32 max). However, I was able to read these files in using the C++ Arrow
> bindings.
>
>
> Best,
> Sarah
>
>
>