You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by "Bhupesh Chawda (JIRA)" <ji...@apache.org> on 2017/02/24 06:15:44 UTC

[jira] [Commented] (APEXMALHAR-2366) Apply BloomFilter to Bucket

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882040#comment-15882040 ] 

Bhupesh Chawda commented on APEXMALHAR-2366:
--------------------------------------------

Hi [~brightchen]
Sorry for commenting so late. 
A bloom filter implementation is already implemented by [~chaithu] in the Megh library. You can see it here: https://github.com/DataTorrent/Megh/blob/master/library/src/main/java/com/datatorrent/lib/bucket/bloomFilter/BloomFilter.java

Can you please see if this implementation can be reused? I am asking this because the one in Megh is well tested as part of an earlier Deduper implementation.

> Apply BloomFilter to Bucket
> ---------------------------
>
>                 Key: APEXMALHAR-2366
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2366
>             Project: Apache Apex Malhar
>          Issue Type: Improvement
>            Reporter: bright chen
>            Assignee: bright chen
>   Original Estimate: 192h
>  Remaining Estimate: 192h
>
> The bucket get() will check the cache and then check from the stored files if the entry is not in the cache. The checking from files is a pretty heavy operation due to file seek.
> The chance of check from file is very high if the key range are large.
> Suggest to apply BloomFilter for bucket to reduce the chance read from file.
> If the buckets were managed by ManagedStateImpl, the entry of bucket would be very huge and the BloomFilter maybe not useful after a while. But If the buckets were managed by ManagedTimeUnifiedStateImpl, each bucket keep certain amount of entry and BloomFilter would be very useful.
> For implementation:
> The Guava already have BloomFilter and the interface are pretty simple and fit for our case. But Guava 11 is not compatible with Guava 14 (Guava 11 use Sink while Guava 14 use PrimitiveSink).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)