You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Bcpoole <gi...@git.apache.org> on 2017/02/09 02:14:41 UTC

[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

GitHub user Bcpoole opened a pull request:

    https://github.com/apache/spark/pull/16864

    [SPARK-19527][Core] Approximate Size of Intersection of Bloom Filters

    **What changes were proposed in this pull request?**
    
    Added functions to get the Swamidass & Baldi (2007) approximation for number of items in a Bloom filter and the intersections of two filters. Added an exception type IncompatibleUnionException mimicing IncompatibleMergeException. As needed for the intersection approximation, there is a function that create the union of two Bloom filters (no mutations).
    
    **How was this patch tested?**
    
    Manual Tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Bcpoole/spark approxItemsInBloomFilterIntersection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16864
    
----
commit 7a3ad46ff86bd3d2d47f6a56bace1a0c4fd171c8
Author: Bcpoole <br...@gmail.com>
Date:   2017-02-09T01:11:07Z

    Swamidass & Baldi approx. items in intersection of two Bloom filters. Also function to create union (non-mutation) of two Bloom filters.

commit b9680c57b2f8b1d93c28884de9a7ebbe52505f6c
Author: Bcpoole <br...@gmail.com>
Date:   2017-02-09T01:42:36Z

    Changed createUnionBloomFilter & approxItemsInIntersection to be instance instead of static functions

commit 501ad7e22101b00862c0c77ef8c38e1b166d33a4
Author: Bcpoole <br...@gmail.com>
Date:   2017-02-09T01:53:50Z

    Updated abstract class to reflect changes in previous commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    I meant just union, but createUnion ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100571673
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java ---
    @@ -221,6 +221,49 @@ public BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeExcep
       }
     
       @Override
    +  public double approxItems() {
    +    double m = bitSize();
    +    return (-m / numHashFunctions) * Math.log(1 - (bits.cardinality() / m));
    --- End diff --
    
    * `Math` is deprecated. Use `math`.
    * Please add a test when `bits` is full. This should return `Double.Infinity`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    ping @mengxr Should we move forward with this PR? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100570007
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -81,6 +81,11 @@ int getVersionNumber() {
       public abstract long bitSize();
     
       /**
    +   * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
    +   */
    +  public abstract double approxItems();
    --- End diff --
    
    It might be easier to keep it as double because the estimate could be out of bound if the bits are full.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100569750
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -81,6 +81,11 @@ int getVersionNumber() {
       public abstract long bitSize();
     
       /**
    +   * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
    --- End diff --
    
    Please describe the method first and its properties (approximation error).  Then put the reference in `@seealso` with a permanent link to the paper: https://dx.doi.org/10.1021%2Fci600526a


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100503141
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -81,6 +81,11 @@ int getVersionNumber() {
       public abstract long bitSize();
     
       /**
    +   * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
    +   */
    +  public abstract double approxItems();
    --- End diff --
    
    yea but that would only be off by 1. I wouldn't worry about that since it is approximate anyway.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100261088
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/IncompatibleUnionException.java ---
    @@ -0,0 +1,24 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.sketch;
    +
    +public class IncompatibleUnionException extends Exception {
    --- End diff --
    
    we need some javadoc ere.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    cc @mengxr / @tjhunter / @jkbradley  is this good to have?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by Bcpoole <gi...@git.apache.org>.
Github user Bcpoole commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100385258
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -81,6 +81,11 @@ int getVersionNumber() {
       public abstract long bitSize();
     
       /**
    +   * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
    +   */
    +  public abstract double approxItems();
    --- End diff --
    
    I was debating this due to possible rounding errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by Bcpoole <gi...@git.apache.org>.
Github user Bcpoole commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    Looks good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by Bcpoole <gi...@git.apache.org>.
Github user Bcpoole commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100381372
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/IncompatibleUnionException.java ---
    @@ -0,0 +1,24 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.sketch;
    +
    +public class IncompatibleUnionException extends Exception {
    --- End diff --
    
    In that case IncompatibleMergeException needs it too :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by Bcpoole <gi...@git.apache.org>.
Github user Bcpoole commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100624882
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java ---
    @@ -221,6 +221,49 @@ public BloomFilter mergeInPlace(BloomFilter other) throws IncompatibleMergeExcep
       }
     
       @Override
    +  public double approxItems() {
    +    double m = bitSize();
    +    return (-m / numHashFunctions) * Math.log(1 - (bits.cardinality() / m));
    --- End diff --
    
    > Math is deprecated. Use math.
    
    Assume you were thinking of Scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16864


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100261151
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -148,6 +153,20 @@ int getVersionNumber() {
       public abstract boolean mightContainBinary(byte[] item);
     
       /**
    +   * Returns a new Bloom filter of the union of two Bloom filters.
    +   * Unlike mergeInplace, this will not cause a mutation.
    +   * Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
    +   *
    +   * @throws IncompatibleUnionException if either are null, different classes, or different size or number of hash functions
    +   */
    +  public abstract BloomFilterImpl createUnionBloomFilter(BloomFilter other) throws IncompatibleUnionException;
    --- End diff --
    
    how about just calling this union?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    @jiangxb1987 yes I agree to close it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100571446
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -148,6 +153,24 @@ int getVersionNumber() {
       public abstract boolean mightContainBinary(byte[] item);
     
       /**
    +   * Returns a new Bloom filter of the union of two Bloom filters.
    +   * Unlike mergeInplace, this will not cause a mutation.
    +   * Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
    +   *
    +   * @param other The bloom filter to union this bloom filter with.
    +   * @throws IncompatibleUnionException if {@code isCompatible(other) == false}
    +   */
    +  public abstract BloomFilterImpl union(BloomFilter other) throws IncompatibleUnionException;
    +
    +  /**
    +   * Swamidass & Baldi (2007) approximation for number of items in the intersection of two Bloom filters
    --- End diff --
    
    * Same here. Document the method first and then mention the reference.
    * How is it different from intersecting two bloom filters and then estimate the number of items? Union might lead to larger approximation error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    Should we close this PR since it goes stale? WDYT @WeichenXu123 ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16864: [SPARK-19527][Core] Approximate Size of Intersection of ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/16864
  
    @Bcpoole Thanks for this PR. But I want to ask which place in spark can this extension apply to ? e.g. can this algo used in join cost estimating or somewhere else ? But if there is no apparent uses for now, I will decrease priority of reviewing this because there're many PRs accumulated waiting review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16864: [SPARK-19527][Core] Approximate Size of Intersect...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16864#discussion_r100261227
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilter.java ---
    @@ -81,6 +81,11 @@ int getVersionNumber() {
       public abstract long bitSize();
     
       /**
    +   * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
    +   */
    +  public abstract double approxItems();
    --- End diff --
    
    shouldn't this return a long rather than a double?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org