You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Shawn Zeng <xz...@gmail.com> on 2022/02/22 12:40:54 UTC

[Python][Parquet]Why the default row_group_size is 64M?

Hi,

The default row_group_size is really large, which means a large table
smaller than 64M rows will not get the benefits of row group level
statistics. What is the reason for this? Do you plan to change the default?

Thanks,
Shawn

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Micah Kornfield <em...@gmail.com>.

Based on this aligning the default for non-dataset write path to be 1
million rows seems to make sense in the short term

On Tuesday, February 22, 2022, Shawn Zeng <xz...@gmail.com> wrote:

> Thanks for Weston's clear explanation. Point 2 is what I've experienced
> without tuning parameter and point 3 is what I concerned about. Looking
> forward for finer granularity of reading/indexing parquet than row group
> level, which should fix the issue.
>
> Weston Pace <we...@gmail.com> 于2022年2月23日周三 13:30写道：
>
>> These are all great points.  A few notes from my own experiments
>> (mostly confirming what others have said):
>>
>>  1) 1M rows is the minimum safe size for row groups on HDD (and
>> perhaps a hair too low in some situations) if you are doing any kind
>> of column selection (i.e. projection pushdown).  As that number gets
>> lower the ratio of skips to reads increases to the point where it
>> starts to look too much like "random read" for the HDD and performance
>> suffers.
>>  2) 64M rows is too high for two (preventable) reasons in the C++
>> datasets API.  The first reason is that the datasets API does not
>> currently support sub-row-group streaming reads (e.g. we can only read
>> one row group at a time from the disk).  Large row groups leads to too
>> much initial latency (have to read an entire row group before we start
>> processing) and too much RAM usage.
>>  3) 64M rows is also typically too high for the C++ datasets API
>> because, as Micah pointed out, we don't yet have support in the
>> datasets API for page-level column indices.  This means that
>> statistics-based filtering is done at the row-group level and very
>> coarse grained.  The bigger the block the less likely it is that a
>> filter will eclipse the entire block.
>>
>> Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
>> found reasonable performance with 1M rows per row group and so I
>> haven't personally been as highly motivated to fix the latter two
>> issues but they are somewhat high up on my personal priority list.  If
>> anyone has time to devote to working on these issues I would be happy
>> to help someone get started.  Ideally, if we can fix those two issues,
>> then something like what Micah described (one row group per file) is
>> fine and we can help shield users from frustrating parameter tuning.
>>
>> I have a draft of a partial fix for 2 at [1][2].  I expect I should be
>> able to get back to it before the 8.0.0 release.  I couldn't find an
>> issue for the more complete fix (scanning at page-resolution instead
>> of row-group-resolution) so I created [3].
>>
>> A good read for the third point is at [4].  I couldn't find a JIRA
>> issue for this from a quick search but I feel that we probably have
>> some issues somewhere.
>>
>> [1] https://github.com/apache/arrow/pull/12228
>> [2] https://issues.apache.org/jira/browse/ARROW-14510
>> [3] https://issues.apache.org/jira/browse/ARROW-15759
>> [4] https://blog.cloudera.com/speeding-up-select-queries-
>> with-parquet-page-indexes/
>>
>> On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>> >
>> > Hi, thank you for your reply. The confusion comes from what Micah
>> mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In
>> that case it does not align will Hadoop block size unless you have only 1B
>> per row. So in most case the row group will be very large than 64MB. I
>> think this parameter uses num of rows instead of size already brings
>> confusion when I read the doc and change the parameter.
>> >
>> > I dont understand clearly why the current thinking is to have 1
>> row-group per file? Could you explain more?
>> >
>> > Micah Kornfield <em...@gmail.com> 于2022年2月23日周三 03:52写道：
>> >>>
>> >>> What is the reason for this? Do you plan to change the default?
>> >>
>> >>
>> >> I think there is some confusion, I do believe this is the number of
>> rows but I'd guess it was set to 64M because it wasn't carefully adapted
>> from parquet-mr which I would guess uses byte size and therefore it aligns
>> well with the HDFS block size.
>> >>
>> >> I don't recall seeing any open issues to change it.  It looks like for
>> datasets [1] the default is 1 million, so maybe we should try to align
>> these two.  I don't have a strong opinion here, but my impression is that
>> the current thinking is to generally have 1 row-group per file, and
>> eliminate entire files.  For sub-file pruning, I think column indexes are a
>> better solution but they have not been implemented in pyarrow yet.
>> >>
>> >> [1] https://arrow.apache.org/docs/python/generated/pyarrow.
>> dataset.write_dataset.html#pyarrow.dataset.write_dataset
>> >>
>> >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
>> marnix.van.den.broek@bundlesandbatches.io> wrote:
>> >>>
>> >>> hi Shawn,
>> >>>
>> >>> I expect this is the default because Parquet comes from the Hadoop
>> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
>> need a different default? You can set it to the size that fits your use
>> case best, right?
>> >>>
>> >>> Marnix
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:
>> >>>>
>> >>>> For a clarification, I am referring to pyarrow.parquet.write_table
>> >>>>
>> >>>> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> The default row_group_size is really large, which means a large
>> table smaller than 64M rows will not get the benefits of row group level
>> statistics. What is the reason for this? Do you plan to change the default?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Shawn
>>
>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Shawn Zeng <xz...@gmail.com>.

Thanks for Weston's clear explanation. Point 2 is what I've experienced
without tuning parameter and point 3 is what I concerned about. Looking
forward for finer granularity of reading/indexing parquet than row group
level, which should fix the issue.

Weston Pace <we...@gmail.com> 于2022年2月23日周三 13:30写道：

> These are all great points.  A few notes from my own experiments
> (mostly confirming what others have said):
>
>  1) 1M rows is the minimum safe size for row groups on HDD (and
> perhaps a hair too low in some situations) if you are doing any kind
> of column selection (i.e. projection pushdown).  As that number gets
> lower the ratio of skips to reads increases to the point where it
> starts to look too much like "random read" for the HDD and performance
> suffers.
>  2) 64M rows is too high for two (preventable) reasons in the C++
> datasets API.  The first reason is that the datasets API does not
> currently support sub-row-group streaming reads (e.g. we can only read
> one row group at a time from the disk).  Large row groups leads to too
> much initial latency (have to read an entire row group before we start
> processing) and too much RAM usage.
>  3) 64M rows is also typically too high for the C++ datasets API
> because, as Micah pointed out, we don't yet have support in the
> datasets API for page-level column indices.  This means that
> statistics-based filtering is done at the row-group level and very
> coarse grained.  The bigger the block the less likely it is that a
> filter will eclipse the entire block.
>
> Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
> found reasonable performance with 1M rows per row group and so I
> haven't personally been as highly motivated to fix the latter two
> issues but they are somewhat high up on my personal priority list.  If
> anyone has time to devote to working on these issues I would be happy
> to help someone get started.  Ideally, if we can fix those two issues,
> then something like what Micah described (one row group per file) is
> fine and we can help shield users from frustrating parameter tuning.
>
> I have a draft of a partial fix for 2 at [1][2].  I expect I should be
> able to get back to it before the 8.0.0 release.  I couldn't find an
> issue for the more complete fix (scanning at page-resolution instead
> of row-group-resolution) so I created [3].
>
> A good read for the third point is at [4].  I couldn't find a JIRA
> issue for this from a quick search but I feel that we probably have
> some issues somewhere.
>
> [1] https://github.com/apache/arrow/pull/12228
> [2] https://issues.apache.org/jira/browse/ARROW-14510
> [3] https://issues.apache.org/jira/browse/ARROW-15759
> [4]
> https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
>
> On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <xz...@gmail.com> wrote:
> >
> > Hi, thank you for your reply. The confusion comes from what Micah
> mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In
> that case it does not align will Hadoop block size unless you have only 1B
> per row. So in most case the row group will be very large than 64MB. I
> think this parameter uses num of rows instead of size already brings
> confusion when I read the doc and change the parameter.
> >
> > I dont understand clearly why the current thinking is to have 1
> row-group per file? Could you explain more?
> >
> > Micah Kornfield <em...@gmail.com> 于2022年2月23日周三 03:52写道：
> >>>
> >>> What is the reason for this? Do you plan to change the default?
> >>
> >>
> >> I think there is some confusion, I do believe this is the number of
> rows but I'd guess it was set to 64M because it wasn't carefully adapted
> from parquet-mr which I would guess uses byte size and therefore it aligns
> well with the HDFS block size.
> >>
> >> I don't recall seeing any open issues to change it.  It looks like for
> datasets [1] the default is 1 million, so maybe we should try to align
> these two.  I don't have a strong opinion here, but my impression is that
> the current thinking is to generally have 1 row-group per file, and
> eliminate entire files.  For sub-file pruning, I think column indexes are a
> better solution but they have not been implemented in pyarrow yet.
> >>
> >> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset
> >>
> >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
> marnix.van.den.broek@bundlesandbatches.io> wrote:
> >>>
> >>> hi Shawn,
> >>>
> >>> I expect this is the default because Parquet comes from the Hadoop
> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
> need a different default? You can set it to the size that fits your use
> case best, right?
> >>>
> >>> Marnix
> >>>
> >>>
> >>>
> >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:
> >>>>
> >>>> For a clarification, I am referring to pyarrow.parquet.write_table
> >>>>
> >>>> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> The default row_group_size is really large, which means a large
> table smaller than 64M rows will not get the benefits of row group level
> statistics. What is the reason for this? Do you plan to change the default?
> >>>>>
> >>>>> Thanks,
> >>>>> Shawn
>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Weston Pace <we...@gmail.com>.

These are all great points.  A few notes from my own experiments
(mostly confirming what others have said):

 1) 1M rows is the minimum safe size for row groups on HDD (and
perhaps a hair too low in some situations) if you are doing any kind
of column selection (i.e. projection pushdown).  As that number gets
lower the ratio of skips to reads increases to the point where it
starts to look too much like "random read" for the HDD and performance
suffers.
 2) 64M rows is too high for two (preventable) reasons in the C++
datasets API.  The first reason is that the datasets API does not
currently support sub-row-group streaming reads (e.g. we can only read
one row group at a time from the disk).  Large row groups leads to too
much initial latency (have to read an entire row group before we start
processing) and too much RAM usage.
 3) 64M rows is also typically too high for the C++ datasets API
because, as Micah pointed out, we don't yet have support in the
datasets API for page-level column indices.  This means that
statistics-based filtering is done at the row-group level and very
coarse grained.  The bigger the block the less likely it is that a
filter will eclipse the entire block.

Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
found reasonable performance with 1M rows per row group and so I
haven't personally been as highly motivated to fix the latter two
issues but they are somewhat high up on my personal priority list.  If
anyone has time to devote to working on these issues I would be happy
to help someone get started.  Ideally, if we can fix those two issues,
then something like what Micah described (one row group per file) is
fine and we can help shield users from frustrating parameter tuning.

I have a draft of a partial fix for 2 at [1][2].  I expect I should be
able to get back to it before the 8.0.0 release.  I couldn't find an
issue for the more complete fix (scanning at page-resolution instead
of row-group-resolution) so I created [3].

A good read for the third point is at [4].  I couldn't find a JIRA
issue for this from a quick search but I feel that we probably have
some issues somewhere.

[1] https://github.com/apache/arrow/pull/12228
[2] https://issues.apache.org/jira/browse/ARROW-14510
[3] https://issues.apache.org/jira/browse/ARROW-15759
[4] https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>
> Hi, thank you for your reply. The confusion comes from what Micah mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In that case it does not align will Hadoop block size unless you have only 1B per row. So in most case the row group will be very large than 64MB. I think this parameter uses num of rows instead of size already brings confusion when I read the doc and change the parameter.
>
> I dont understand clearly why the current thinking is to have 1 row-group per file? Could you explain more?
>
> Micah Kornfield <em...@gmail.com> 于2022年2月23日周三 03:52写道：
>>>
>>> What is the reason for this? Do you plan to change the default?
>>
>>
>> I think there is some confusion, I do believe this is the number of rows but I'd guess it was set to 64M because it wasn't carefully adapted from parquet-mr which I would guess uses byte size and therefore it aligns well with the HDFS block size.
>>
>> I don't recall seeing any open issues to change it.  It looks like for datasets [1] the default is 1 million, so maybe we should try to align these two.  I don't have a strong opinion here, but my impression is that the current thinking is to generally have 1 row-group per file, and eliminate entire files.  For sub-file pruning, I think column indexes are a better solution but they have not been implemented in pyarrow yet.
>>
>> [1] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset
>>
>> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <ma...@bundlesandbatches.io> wrote:
>>>
>>> hi Shawn,
>>>
>>> I expect this is the default because Parquet comes from the Hadoop ecosystem, and the Hadoop block size is usually set to 64MB. Why would you need a different default? You can set it to the size that fits your use case best, right?
>>>
>>> Marnix
>>>
>>>
>>>
>>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:
>>>>
>>>> For a clarification, I am referring to pyarrow.parquet.write_table
>>>>
>>>> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
>>>>>
>>>>> Hi,
>>>>>
>>>>> The default row_group_size is really large, which means a large table smaller than 64M rows will not get the benefits of row group level statistics. What is the reason for this? Do you plan to change the default?
>>>>>
>>>>> Thanks,
>>>>> Shawn

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Shawn Zeng <xz...@gmail.com>.

Hi, thank you for your reply. The confusion comes from what Micah mentions:
the row_group_size in pyarrow is 64M rows, instead of 64MB. In that case it
does not align will Hadoop block size unless you have only 1B per row. So
in most case the row group will be very large than 64MB. I think this
parameter uses num of rows instead of size already brings confusion when I
read the doc and change the parameter.

I dont understand clearly why the current thinking is to have 1 row-group
per file? Could you explain more?

Micah Kornfield <em...@gmail.com> 于2022年2月23日周三 03:52写道：

> What is the reason for this? Do you plan to change the default?
>
>
> I think there is some confusion, I do believe this is the number of rows
> but I'd guess it was set to 64M because it wasn't carefully adapted from
> parquet-mr which I would guess uses byte size and therefore it aligns well
> with the HDFS block size.
>
> I don't recall seeing any open issues to change it.  It looks like for
> datasets [1] the default is 1 million, so maybe we should try to align
> these two.  I don't have a strong opinion here, but my impression is that
> the current thinking is to generally have 1 row-group per file, and
> eliminate entire files.  For sub-file pruning, I think column indexes are a
> better solution but they have not been implemented in pyarrow yet.
>
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset
>
> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
> marnix.van.den.broek@bundlesandbatches.io> wrote:
>
>> hi Shawn,
>>
>> I expect this is the default because Parquet comes from the Hadoop
>> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
>> need a different default? You can set it to the size that fits your use
>> case best, right?
>>
>> Marnix
>>
>>
>>
>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:
>>
>>> For a clarification, I am referring to pyarrow.parquet.write_table
>>>
>>> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
>>>
>>>> Hi,
>>>>
>>>> The default row_group_size is really large, which means a large table
>>>> smaller than 64M rows will not get the benefits of row group level
>>>> statistics. What is the reason for this? Do you plan to change the default?
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Micah Kornfield <em...@gmail.com>.

>
> What is the reason for this? Do you plan to change the default?

I think there is some confusion, I do believe this is the number of rows
but I'd guess it was set to 64M because it wasn't carefully adapted from
parquet-mr which I would guess uses byte size and therefore it aligns well
with the HDFS block size.

I don't recall seeing any open issues to change it.  It looks like for
datasets [1] the default is 1 million, so maybe we should try to align
these two.  I don't have a strong opinion here, but my impression is that
the current thinking is to generally have 1 row-group per file, and
eliminate entire files.  For sub-file pruning, I think column indexes are a
better solution but they have not been implemented in pyarrow yet.

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset

On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
marnix.van.den.broek@bundlesandbatches.io> wrote:

> hi Shawn,
>
> I expect this is the default because Parquet comes from the Hadoop
> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
> need a different default? You can set it to the size that fits your use
> case best, right?
>
> Marnix
>
>
>
> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:
>
>> For a clarification, I am referring to pyarrow.parquet.write_table
>>
>> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
>>
>>> Hi,
>>>
>>> The default row_group_size is really large, which means a large table
>>> smaller than 64M rows will not get the benefits of row group level
>>> statistics. What is the reason for this? Do you plan to change the default?
>>>
>>> Thanks,
>>> Shawn
>>>
>>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Marnix van den Broek <ma...@bundlesandbatches.io>.

hi Shawn,

I expect this is the default because Parquet comes from the Hadoop
ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
need a different default? You can set it to the size that fits your use
case best, right?

Marnix

On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <xz...@gmail.com> wrote:

> For a clarification, I am referring to pyarrow.parquet.write_table
>
> Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：
>
>> Hi,
>>
>> The default row_group_size is really large, which means a large table
>> smaller than 64M rows will not get the benefits of row group level
>> statistics. What is the reason for this? Do you plan to change the default?
>>
>> Thanks,
>> Shawn
>>
>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Posted by Shawn Zeng <xz...@gmail.com>.

For a clarification, I am referring to pyarrow.parquet.write_table

Shawn Zeng <xz...@gmail.com> 于2022年2月22日周二 20:40写道：

> Hi,
>
> The default row_group_size is really large, which means a large table
> smaller than 64M rows will not get the benefits of row group level
> statistics. What is the reason for this? Do you plan to change the default?
>
> Thanks,
> Shawn
>