You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Xinli Shang (Jira)" <ji...@apache.org> on 2019/10/28 15:36:00 UTC

[jira] [Comment Edited] (PARQUET-1685) Truncate the stored min and max for String statistics to reduce the footer size

    [ https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961165#comment-16961165 ] 

Xinli Shang edited comment on PARQUET-1685 at 10/28/19 3:35 PM:
----------------------------------------------------------------

Hi [~gszadovszky] Thanks for your reply!  

Regarding "an implementation might rely on the fact that the min/max values are actual values", did you already have discussions earlier when the 'column index' implemented the **statistics truncating?  I would like to add  [~rdblue] who might already have discussions and thinkings because this is implemented in IceBerg. 

For the 4k hard limit, I am thinking from the other way.  If empty statistics were written because of oversizing statistics, it would cause the query inefficient.  And if truncating can improve(reduce) the size and as a result reduce the number of empty statistics files, then it is a big win.

In 1.11.0+, is it enforced to use the 'column index' and not to write to page statistics? 

 

 


was (Author: shangx@uber.com):
Hi [~gszadovszky] Thanks for your reply!  

Regarding "an implementation might rely on the fact that the min/max values are actual values", did you already have discussions earlier when the 'column index' implemented the **statistics truncating?  I would like to add  [~rdblue] who might already have discussions and thinkings because this is implemented in IceBerg. 

For the 4k hard limit, I am thinking from the other way.  If empty statistics were written because of oversizing statistics, it would cause the query inefficient.  And if truncating can improve(reduce) the size and reduce the number of empty statistics files, then it is a big win.

In 1.11.0+, is it enforced to use the 'column index' and not to write to page statistics? 

 

 

> Truncate the stored min and max for String statistics to reduce the footer size 
> --------------------------------------------------------------------------------
>
>                 Key: PARQUET-1685
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1685
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to minimize the metadata size. We can borrow to truncate them in Parquet also to reduce the size of the footer, or even the page header. Here is the code in IceBerg [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java]. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)