You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "yabola (via GitHub)" <gi...@apache.org> on 2023/03/07 07:15:03 UTC

[GitHub] [parquet-mr] yabola commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457669848

   @wgtmac @gszadovszky 
   I have a proposal to  automatically build BloomFilter with a more precise size. I create a jira https://issues.apache.org/jira/browse/PARQUET-2254 and  I hope to get your opinions, thank you.
   
   > Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
   If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
   
   I have an idea that the user can specify a maximum BloomFilter filter size, then we build several BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough)
   Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written.
   
   I want to implement this feature and


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org