You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Anthony Pessy (Jira)" <ji...@apache.org> on 2023/01/02 06:55:00 UTC

[jira] [Commented] (PARQUET-1911) Add way to disables statistics on a per column basis

    [ https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653512#comment-17653512 ] 

Anthony Pessy commented on PARQUET-1911:
----------------------------------------

[~shangx@uber.com] Sorry for the late reply it seems I did not receive a notification.

 

The truncation on min/max is not sufficient because I'm having an out of memory way before I'm even close to writing a footer as the truncation takes place when it actually write the footer (If I recall) and it still keep the whole values in memory until then.

 

(The whole idea of min/max against a HTML payload or similar fields does not have much sense anyway).

 

I'll try to see if I can gather the time to suggest a PR as I'd like to stop relying on a forked version.

> Add way to disables statistics on a per column basis
> ----------------------------------------------------
>
>                 Key: PARQUET-1911
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1911
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Anthony Pessy
>            Priority: Major
>         Attachments: NoOpStatistics.java, add_config_to_opt-out_of_a_column's_statistics.patch
>
>
> When you write dataset with BINARY columns that can be fairly large (several Mbs) you can often end with an OutOfMemory error where you either have to:
>  
>  - Throw more RAM
>  - Increase number of output files
>  - Play with Block size
>  
> Using a fork with increased checks frequency for row group size help but it is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])
>  
>  
> The OutOfMemory error is now caused due to the accumulation of min/max values for those columns for each BlockMetaData.
>  
> The "parquet.statistics.truncate.length" configuration is of no help because it is applied during the footer serialization whereas the OOM occurs before that.
>  
> I think it would be nice to have, like for dictionary or bloom filter, a way to disable the statistic on a per-column basis.
>  
> Could be very useful to lower memory consumption when stats of huge binary column are unnecessary.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)