You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by ALeX Wang <ee...@gmail.com> on 2018/01/12 18:43:02 UTC

Recommended rowgroup size, and number of row groups for large table

Hi,

I'm using parquet to store a big table (400+ columns), and most of columns
will be none

Is there any recommended rowgroup size and the number of row groups per
parquet file for my use case?  Or is there any reference/paper that I could
read myself,


Thanks,
-- 
Alex Wang,
Open vSwitch developer

Re: Recommended rowgroup size, and number of row groups for large table

Posted by Zoltan Ivanfi <zi...@cloudera.com>.
Hi,

If you use HDFS, then the row group size should match the HDFS block size,
otherwise data locality (thus performance) will suffer.

Regarding page size, in general larger pages lead to smaller files. On the
other hand, the page-level metadata may include min and max values that can
be used for skipping entire pages when looking for specific values which do
not fall in their min-max range. With larger pages, this possibility to
skip pages becomes less fine-grained, so in the end more data may have to
be deserialized.

Zoltan

On Fri, Jan 12, 2018 at 10:19 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I recommend trying different values using the parquet-cli. That's an easy
> way to see how different row group and page sizes perform. That's what I do
> to tune all of our tables.
>
> rb
>
> On Fri, Jan 12, 2018 at 10:43 AM, ALeX Wang <ee...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm using parquet to store a big table (400+ columns), and most of
> columns
> > will be none
> >
> > Is there any recommended rowgroup size and the number of row groups per
> > parquet file for my use case?  Or is there any reference/paper that I
> could
> > read myself,
> >
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Recommended rowgroup size, and number of row groups for large table

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I recommend trying different values using the parquet-cli. That's an easy
way to see how different row group and page sizes perform. That's what I do
to tune all of our tables.

rb

On Fri, Jan 12, 2018 at 10:43 AM, ALeX Wang <ee...@gmail.com> wrote:

> Hi,
>
> I'm using parquet to store a big table (400+ columns), and most of columns
> will be none
>
> Is there any recommended rowgroup size and the number of row groups per
> parquet file for my use case?  Or is there any reference/paper that I could
> read myself,
>
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Ryan Blue
Software Engineer
Netflix