You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2019/10/28 09:11:00 UTC

[jira] [Commented] (PARQUET-1685) Truncate the stored min and max for String statistics to reduce the footer size

    [ https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960878#comment-16960878 ] 

Gabor Szadovszky commented on PARQUET-1685:
-------------------------------------------

We implemented a similar feature for column indexes. We were able to that because the [specification|https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach] allows it.
Unfortunately, we did not say anything like that for the min/max values in the footer. It means that an implementation might rely on the fact that the min/max values are actual values in the related page/rowgroup therefore, might implement some logic accordingly.
I am not sure if we start truncating the values can cause any troubles in the parquet implementation but it worth thinking about it and might require some discussions on the dev list.

BTW, parquet-mr currently implements a 4k hard limit for statistics so empty Statistics objects will be written to the footer if the min value + max value exceeds this limit. Moreover, after 1.11.0 we will not write statistics into the page headers so we are only talking about one Statistics object per rowgroup. Does it really worth adding the truncation for the additional 4k (at maximum) per rowgroup?

> Truncate the stored min and max for String statistics to reduce the footer size 
> --------------------------------------------------------------------------------
>
>                 Key: PARQUET-1685
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1685
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to minimize the metadata size. We can borrow to truncate them in Parquet also to reduce the size of the footer, or even the page header. Here is the code in IceBerg [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java]. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)