You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gabor Szadovszky <ga...@apache.org> on 2018/06/21 08:17:20 UTC

Estimated row-group size is significantly higher than the written one

Hi All,

One of our customers faced the following issue. parquet.block.size is
configured to 128M. (parquet.writer.max-padding is left with the default
8M.) In average 7 row-groups are generated in one block with the sizes
~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M
only one row-group per block is written but it is a waste of disk space.
By investigating the logs it turns out that parquet-mr thinks the row-group
is already close to 128M so it writes the first one then realize we still
have space to write until reaching the block size and so on:
INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
134,217,728: flushing 484,972 records to disk.
INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > 59,814,925:
flushing 99,030 records to disk.
INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > 43,397,248:
flushing 71,848 records to disk.
...

My idea about the root cause is that there are many dictionary encoded
columns where the value variance is low. When we are approximating the
row-group size there are pages which are still open (not encoded yet). If
these pages are dictionary encoded we calculate with 4bytes values as the
dictionary indexes. But if the variance is low, the RLE and bitpacking will
decrease the size of these pages dramatically.

What do you guys think? Are we able to make the approximation a bit better?
Do we have some properties that can solve this issue?

Thanks a lot,
Gabor

Re: Estimated row-group size is significantly higher than the written one

Posted by Gabor Szadovszky <ga...@apache.org>.
Thanks a lot, Ryan. Created the JIRA PARQUET-1337 to track it.

On Sat, Jun 23, 2018 at 1:29 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I think you're right about the cause. The current estimate is what is
> buffered in memory, so it includes all of the intermediate data for the
> last page before it is finalized and compressed.
>
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
>
> rb
>
> On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Hi All,
> >
> > One of our customers faced the following issue. parquet.block.size is
> > configured to 128M. (parquet.writer.max-padding is left with the default
> > 8M.) In average 7 row-groups are generated in one block with the sizes
> > ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g.
> 60M
> > only one row-group per block is written but it is a waste of disk space.
> > By investigating the logs it turns out that parquet-mr thinks the
> row-group
> > is already close to 128M so it writes the first one then realize we still
> > have space to write until reaching the block size and so on:
> > INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
> > 134,217,728: flushing 484,972 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 >
> 59,814,925:
> > flushing 99,030 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 >
> 43,397,248:
> > flushing 71,848 records to disk.
> > ...
> >
> > My idea about the root cause is that there are many dictionary encoded
> > columns where the value variance is low. When we are approximating the
> > row-group size there are pages which are still open (not encoded yet). If
> > these pages are dictionary encoded we calculate with 4bytes values as the
> > dictionary indexes. But if the variance is low, the RLE and bitpacking
> will
> > decrease the size of these pages dramatically.
> >
> > What do you guys think? Are we able to make the approximation a bit
> better?
> > Do we have some properties that can solve this issue?
> >
> > Thanks a lot,
> > Gabor
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Estimated row-group size is significantly higher than the written one

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I think you're right about the cause. The current estimate is what is
buffered in memory, so it includes all of the intermediate data for the
last page before it is finalized and compressed.

We could probably get a better estimate by using the amount of buffered
data and how large other pages in a column were after fully encoding and
compressing. So if you have 5 pages compressed and buffered, and another
1000 values, use the compression ratio of the 5 pages to estimate the final
size. We'd probably want to use some overhead value for the header. And,
we'd want to separate the amount of buffered data from our row group size
estimate, which are currently the same thing.

rb

On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi All,
>
> One of our customers faced the following issue. parquet.block.size is
> configured to 128M. (parquet.writer.max-padding is left with the default
> 8M.) In average 7 row-groups are generated in one block with the sizes
> ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M
> only one row-group per block is written but it is a waste of disk space.
> By investigating the logs it turns out that parquet-mr thinks the row-group
> is already close to 128M so it writes the first one then realize we still
> have space to write until reaching the block size and so on:
> INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
> 134,217,728: flushing 484,972 records to disk.
> INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > 59,814,925:
> flushing 99,030 records to disk.
> INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > 43,397,248:
> flushing 71,848 records to disk.
> ...
>
> My idea about the root cause is that there are many dictionary encoded
> columns where the value variance is low. When we are approximating the
> row-group size there are pages which are still open (not encoded yet). If
> these pages are dictionary encoded we calculate with 4bytes values as the
> dictionary indexes. But if the variance is low, the RLE and bitpacking will
> decrease the size of these pages dramatically.
>
> What do you guys think? Are we able to make the approximation a bit better?
> Do we have some properties that can solve this issue?
>
> Thanks a lot,
> Gabor
>


-- 
Ryan Blue
Software Engineer
Netflix