You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/18 19:23:58 UTC

[jira] [Closed] (SPARK-18252) Improve serialized BloomFilter size

     [ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin closed SPARK-18252.
-------------------------------
    Resolution: Won't Fix

> Improve serialized BloomFilter size
> -----------------------------------
>
>                 Key: SPARK-18252
>                 URL: https://issues.apache.org/jira/browse/SPARK-18252
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.1
>            Reporter: Aleksey Ponkin
>            Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current implementation is using custom class org.apache.spark.util.sketch.BitArray for bit vector, which is allocating memory for the whole filter no matter how many elements are set. Since BloomFilter can be serialized and sent over network in distribution stage may be we need to use some kind of compressed bloom filters? For example [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org