You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2023/03/07 08:09:00 UTC

[jira] [Assigned] (PARQUET-2254) Build a BloomFilter with a more precise size

     [ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky reassigned PARQUET-2254:
-----------------------------------------

    Assignee: Mars

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, then we build multiple BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)