You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2018/07/23 15:59:00 UTC

[jira] [Assigned] (PARQUET-1337) Implement better estimate of page size for RLE+bitpacking

     [ https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Ivanfi reassigned PARQUET-1337:
--------------------------------------

    Assignee: Zoltan Ivanfi

> Implement better estimate of page size for RLE+bitpacking
> ---------------------------------------------------------
>
>                 Key: PARQUET-1337
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1337
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>
> If there are many columns with encoding RLE+bitpacking (e.g. dictionary encoding) where the value variance is low the estimate of the size of the open pages (which are not encoded yet) are much larger than the final page size. Because of that parquet-mr fails to create row-groups that size are close to {{parquet.block.size}} which causes performance issues while reading.
> A hint from Ryan to solve this issue:
> {quote}
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
> {quote}
> (So, it is not only about RLE+bitpacking but any kind of encoding which is done only after "closing" a page.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)