You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2021/02/01 08:49:00 UTC

[jira] [Commented] (PARQUET-1805) Refactor the configuration for bloom filters

    [ https://issues.apache.org/jira/browse/PARQUET-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276149#comment-17276149 ] 

Gabor Szadovszky commented on PARQUET-1805:
-------------------------------------------

[~yumwang], I think this performance issue is not related to this jira but the whole bloom filter feature (PARQUET-41). If you turn on the writing of the bloom filters for all the columns it will impact writing performance. (You may check the related configuration parameters at https://github.com/apache/parquet-mr/tree/master/parquet-hadoop for details.)

I am not an expert of this feature and maybe we can improve the writing performance but generating bloom filters will have performance impact. It is up to the user to decide if this impact worth for the potential benefit at read time. That's why it is highly suggested to specify which exact columns are the bloom filters required for and also to specify the other parameters for bloom filter.

[~junjie], any comments on this?

> Refactor the configuration for bloom filters
> --------------------------------------------
>
>                 Key: PARQUET-1805
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1805
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>
> Refactor the hadoop configuration for bloom filters according to PARQUET-1784.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)