You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aleksey Ponkin (JIRA)" <ji...@apache.org> on 2016/11/09 11:33:58 UTC

[jira] [Issue Comment Deleted] (SPARK-18252) Improve serialized BloomFilter size

     [ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aleksey Ponkin updated SPARK-18252:
-----------------------------------
    Comment: was deleted

(was: [~cloud_fan] Can you, please clear for me some things:
[BloomFilterImpl.java | https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java#L93-L98] here you are using int combinedHash , but we can create BloomFilter with expected insertions that are much larger than Integer.MAX_VALUE. It turns out for me, that for such filters all changed bits inside BitArray will be in a range [0.. Integer.MAX_VALUE]. That means that we can not guarantee expected false positive probability for filters larger than Integer.MAX_VALUE. Is it correct? Or may be it is a bug? I`ve just checked guava library BloomFilter, and they use different strategies to translate element instances, to bit indexes [BloomFilterStrategies | https://github.com/google/guava/blob/6ded67ff124f1be4f318dbbeae136d0c995faf37/guava/src/com/google/common/hash/BloomFilterStrategies.java]. If it is a bug, do we need another lira for that? Or I can fix it along with this lira ticket?
Thanks in advance!)

> Improve serialized BloomFilter size
> -----------------------------------
>
>                 Key: SPARK-18252
>                 URL: https://issues.apache.org/jira/browse/SPARK-18252
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.1
>            Reporter: Aleksey Ponkin
>            Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current implementation is using custom class org.apache.spark.util.sketch.BitArray for bit vector, which is allocating memory for the whole filter no matter how many elements are set. Since BloomFilter can be serialized and sent over network in distribution stage may be we need to use some kind of compressed bloom filters? For example [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org