You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Abhishek Jain (Jira)" <ji...@apache.org> on 2022/12/30 07:12:00 UTC

[jira] [Commented] (PARQUET-2220) Parquet Filter predicate storing nested string causing OOM's

    [ https://issues.apache.org/jira/browse/PARQUET-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653038#comment-17653038 ] 

Abhishek Jain commented on PARQUET-2220:
----------------------------------------

Can anyone from parquet contributors take a look on this ? 

> Parquet Filter predicate storing nested string causing OOM's
> ------------------------------------------------------------
>
>                 Key: PARQUET-2220
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2220
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Abhishek Jain
>            Priority: Critical
>
> Each Instance of ColumnFilterPredicate stores the filter values in toString variable eagerly. Which is not useful
> {code:java}
> static abstract class ColumnFilterPredicate<T extends Comparable<T>> implements FilterPredicate, Serializable  {
>   private final Column<T> column;
>   private final T value;
>   private final String toString; 
> protected ColumnFilterPredicate(Column<T> column, T value) {
>   this.column = Objects.requireNonNull(column, "column cannot be null");
>   // Eq and NotEq allow value to be null, Lt, Gt, LtEq, GtEq however do not, so they guard against
>   // null in their own constructors.
>   this.value = value;
>   String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
>   this.toString = name + "(" + column.getColumnPath().toDotString() + ", " + value + ")";
> }{code}
>  
>  
> If your filter predicate is too long/nested this can take a lot of memory while creating Filter.
> We have seen in our productions this can go upto 4gbs of space while opening multiple parquet readers
> Same thing is replicated in BinaryLogicalFilterPredicate. Where toString is eagerly calculated and stored in string and lot of duplication is happening while making And/or filter.
> I did not find use case of storing it so eagerly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)