You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by 俊杰陈 <cj...@gmail.com> on 2018/06/26 11:43:54 UTC

How to set default maximum size of bloom filter?

Hi devs

I'm now implementing bloom filter feature and need to set a default maximum
value for bloom filter size for a block. According to calculation here
<https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0.>,
I plan to set maximum size to 1/8 of parquet.block.size which can achieve
about 0.25 FPP in case of only one column of long type in a block and all
values are different.  What do you think about this?  Any feedback is
welcome.

-- 
Thanks & Best Regards

Re: How to set default maximum size of bloom filter?

Posted by 俊杰陈 <cj...@gmail.com>.
Thanks, I agree to discuss when we working on write-side task.

Ryan Blue <rb...@netflix.com> 于2018年6月27日周三 上午7:25写道:

> Thanks for the additional context, but I don't quite get why a utility
> class like this would need to make a call on what the maximum size of a
> bloom filter should be in the format. That's really a write-side concern.
> Can we just remove that code from the current PR and discuss it when we are
> working on how to produce appropriately-configured bloom filters?
>
> On Tue, Jun 26, 2018 at 4:09 PM 俊杰陈 <cj...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> The last comment on doc is  to provide a benchmark for dictionary vs
>> Bloom filter, I provided benchmark result here
>> <https://docs.google.com/spreadsheets/d/1yV3u-P_yY4DtfSty3LPrbhwuJx4cqm_YeK61s2v0OLU/edit?usp=sharing>,
>> Jim have reviewed this and updated comments on JIRA also. You can access
>> JIRA <https://issues.apache.org/jira/browse/PARQUET-41> to get latest
>> status.
>>
>> We created some sub tasks for PARQUET-41, and first step [JIRA-1332
>> <https://issues.apache.org/jira/browse/PARQUET-1332>] is to implement
>> Bloom filter utility class itself in parquet-mr and paruqet-cpp. The
>> question above is related to it.
>>
>>
>>
>> Ryan Blue <rb...@netflix.com.invalid> 于2018年6月27日周三 上午12:35写道:
>>
>>> I thought the plan was to finish the bloom filter spec and then decide
>>> how
>>> to create appropriately sized filters. This sounds like a write-side
>>> implementation detail to me. What is the current plan for getting this
>>> work
>>> in?
>>>
>>> On Mon, Jun 25, 2018 at 8:43 PM 俊杰陈 <cj...@gmail.com> wrote:
>>>
>>> > Hi devs
>>> >
>>> > I'm now implementing bloom filter feature and need to set a default
>>> maximum
>>> > value for bloom filter size for a block. According to calculation here
>>> > <
>>> >
>>> https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0
>>> > .>,
>>> > I plan to set maximum size to 1/8 of parquet.block.size which can
>>> achieve
>>> > about 0.25 FPP in case of only one column of long type in a block and
>>> all
>>> > values are different.  What do you think about this?  Any feedback is
>>> > welcome.
>>> >
>>> > --
>>> > Thanks & Best Regards
>>> >
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Thanks & Best Regards
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Thanks & Best Regards

Re: How to set default maximum size of bloom filter?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for the additional context, but I don't quite get why a utility
class like this would need to make a call on what the maximum size of a
bloom filter should be in the format. That's really a write-side concern.
Can we just remove that code from the current PR and discuss it when we are
working on how to produce appropriately-configured bloom filters?

On Tue, Jun 26, 2018 at 4:09 PM 俊杰陈 <cj...@gmail.com> wrote:

> Hi Ryan,
>
> The last comment on doc is  to provide a benchmark for dictionary vs Bloom
> filter, I provided benchmark result here
> <https://docs.google.com/spreadsheets/d/1yV3u-P_yY4DtfSty3LPrbhwuJx4cqm_YeK61s2v0OLU/edit?usp=sharing>,
> Jim have reviewed this and updated comments on JIRA also. You can access
> JIRA <https://issues.apache.org/jira/browse/PARQUET-41> to get latest
> status.
>
> We created some sub tasks for PARQUET-41, and first step [JIRA-1332
> <https://issues.apache.org/jira/browse/PARQUET-1332>] is to implement
> Bloom filter utility class itself in parquet-mr and paruqet-cpp. The
> question above is related to it.
>
>
>
> Ryan Blue <rb...@netflix.com.invalid> 于2018年6月27日周三 上午12:35写道:
>
>> I thought the plan was to finish the bloom filter spec and then decide how
>> to create appropriately sized filters. This sounds like a write-side
>> implementation detail to me. What is the current plan for getting this
>> work
>> in?
>>
>> On Mon, Jun 25, 2018 at 8:43 PM 俊杰陈 <cj...@gmail.com> wrote:
>>
>> > Hi devs
>> >
>> > I'm now implementing bloom filter feature and need to set a default
>> maximum
>> > value for bloom filter size for a block. According to calculation here
>> > <
>> >
>> https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0
>> > .>,
>> > I plan to set maximum size to 1/8 of parquet.block.size which can
>> achieve
>> > about 0.25 FPP in case of only one column of long type in a block and
>> all
>> > values are different.  What do you think about this?  Any feedback is
>> > welcome.
>> >
>> > --
>> > Thanks & Best Regards
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Thanks & Best Regards
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: How to set default maximum size of bloom filter?

Posted by 俊杰陈 <cj...@gmail.com>.
Hi Ryan,

The last comment on doc is  to provide a benchmark for dictionary vs Bloom
filter, I provided benchmark result here
<https://docs.google.com/spreadsheets/d/1yV3u-P_yY4DtfSty3LPrbhwuJx4cqm_YeK61s2v0OLU/edit?usp=sharing>,
Jim have reviewed this and updated comments on JIRA also. You can access
JIRA <https://issues.apache.org/jira/browse/PARQUET-41> to get latest
status.

We created some sub tasks for PARQUET-41, and first step [JIRA-1332
<https://issues.apache.org/jira/browse/PARQUET-1332>] is to implement Bloom
filter utility class itself in parquet-mr and paruqet-cpp. The question
above is related to it.



Ryan Blue <rb...@netflix.com.invalid> 于2018年6月27日周三 上午12:35写道:

> I thought the plan was to finish the bloom filter spec and then decide how
> to create appropriately sized filters. This sounds like a write-side
> implementation detail to me. What is the current plan for getting this work
> in?
>
> On Mon, Jun 25, 2018 at 8:43 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Hi devs
> >
> > I'm now implementing bloom filter feature and need to set a default
> maximum
> > value for bloom filter size for a block. According to calculation here
> > <
> >
> https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0
> > .>,
> > I plan to set maximum size to 1/8 of parquet.block.size which can achieve
> > about 0.25 FPP in case of only one column of long type in a block and all
> > values are different.  What do you think about this?  Any feedback is
> > welcome.
> >
> > --
> > Thanks & Best Regards
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Thanks & Best Regards

Re: How to set default maximum size of bloom filter?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I thought the plan was to finish the bloom filter spec and then decide how
to create appropriately sized filters. This sounds like a write-side
implementation detail to me. What is the current plan for getting this work
in?

On Mon, Jun 25, 2018 at 8:43 PM 俊杰陈 <cj...@gmail.com> wrote:

> Hi devs
>
> I'm now implementing bloom filter feature and need to set a default maximum
> value for bloom filter size for a block. According to calculation here
> <
> https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0
> .>,
> I plan to set maximum size to 1/8 of parquet.block.size which can achieve
> about 0.25 FPP in case of only one column of long type in a block and all
> values are different.  What do you think about this?  Any feedback is
> welcome.
>
> --
> Thanks & Best Regards
>


-- 
Ryan Blue
Software Engineer
Netflix