You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/10/17 18:21:00 UTC

[jira] [Commented] (PARQUET-1414) Limit page size based on maximum row count

    [ https://issues.apache.org/jira/browse/PARQUET-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653995#comment-16653995 ] 

Ryan Blue commented on PARQUET-1414:
------------------------------------

[~gszadovszky], can you add a link to your benchmarks to this issue?

I think the conclusion we came to while discussing was between 10k and 20k, with 20k being the better choice for overall file size. Is 20k the planned default now?

> Limit page size based on maximum row count
> ------------------------------------------
>
>                 Key: PARQUET-1414
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1414
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>             Fix For: 1.11.0
>
>
> For column index based filtering it is important to have enough pages for a column. In case of a perfectly matching encoding for the suitable data it can happen that all of the values can be encoded in one page (e.g. a column of an ascending counter).
> With this improvement we would be able to limit the pages by the maximum number of rows to be written in it so we would have enough pages for every column. A good default value should be benchmarked. For initial, we can use 10k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)