You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2023/03/07 08:43:00 UTC

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

    [ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697301#comment-17697301 ] 

Gabor Szadovszky commented on PARQUET-2254:
-------------------------------------------

I think this is a good idea. Meanwhile, it would increase the memory footprint of the writer. However, if you plan to keep the current logic that the user decides the columns which bloom filters are generated for, it should be acceptable.
However, I think, we need to take one step back and investigate/synchronize the efforts around row group filtering. Or maybe it is only me for whom the following questions are not obvious? :)
* Is it always true that reading the dictionary for filtering is cheaper than reading the bloom filter? Bloom filters should be usually smaller than dictionaries and faster to be scanned for a value.
* Based on the previous one if we decide that it might worth reading the bloom filter before the dictionary it also questions the logic of not writing bloom filters in case of the whole column chunk is dictionary encoded.
* Meanwhile, if the whole column chunk is dictionary encoded but the dictionary is still small (the redundancy is high) then it might not worth writing a bloom filter since checking the dictionary might be cheaper.
What do you think?

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, then we build multiple BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)