You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Anthony Pessy (Jira)" <ji...@apache.org> on 2020/09/08 12:59:00 UTC

[jira] [Created] (PARQUET-1911) Add way to disables statistics on a per column basis

Anthony Pessy created PARQUET-1911:
--------------------------------------

             Summary: Add way to disables statistics on a per column basis
                 Key: PARQUET-1911
                 URL: https://issues.apache.org/jira/browse/PARQUET-1911
             Project: Parquet
          Issue Type: New Feature
          Components: parquet-mr
            Reporter: Anthony Pessy


When you have a dataset with BINARY columns that can be fairly large (several Mbs) you can often end with an OutOfMemory error where you either have to:

 

 - Throw more RAM

 - Increase number of output files

 - Play with Block size

 

Using a fork with increased checks frequency for row group size help but it is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])

 

 

The OutOfMemory error is now caused due to the accumulation of min/max values for those columns for each BlockMetaData.

 

The "parquet.statistics.truncate.length" configuration is of no help because it is applied during the footer serialization whereas the OOM occurs before that.

 

I think it would be nice to have, like for dictionary or bloom filter, a way to disable the statistic on a per-column basis.

 

Could be very useful to lower memory consumption when stats of huge binary column are unnecessary.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)