You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/08/26 20:51:44 UTC

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2142

    [SPARK-2871] [PySpark] add countApproxDistinct() API

    RDD.countApproxDistinct(relativeSD=0.05):
    
            :: Experimental ::
            Return approximate number of distinct elements in the RDD.
    
            The algorithm used is based on streamlib's implementation of
            "HyperLogLog in Practice: Algorithmic Engineering of a State
            of The Art Cardinality Estimation Algorithm", available
            <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    
            This support all the types of objects, which is supported by
            Pyrolite, nearly all builtin types.
    
            @param relativeSD Relative accuracy. Smaller values create
                               counters that require more space.
                               It must be greater than 0.000017.
    
            >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
            >>> 950 < n < 1050
            True

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark countApproxDistinct

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2142
    
----
commit e97e342dac1cbbad2e424a39159a6c7f3fa63bf4
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-26T18:49:34Z

    add countApproxDistinct()

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53795341
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19423/consoleFull) for   PR 2142 at commit [`2ab157c`](https://github.com/apache/spark/commit/2ab157c567fe9f45394160377ac7f26c05ac2f0b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16915511
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,46 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    +        >>> 950 < n < 1050
    +        True
    +        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    +        >>> 18 < n < 22
    +        True
    +        """
    +        if relativeSD < 0.000017:
    +            raise ValueError("relativeSD should be greater than 0.000017")
    +        if relativeSD > 0.37:
    +            raise ValueError("relativeSD should be smaller than 0.37")
    +        hashRDD = self.map(lambda x: portable_hash(x) % sys.maxint)
    +        c = hashRDD._to_java_object_rdd().countApproxDistinct(relativeSD)
    +        # range of hash is [0, sys.maxint]
    +        if c > sys.maxint / 30:
    --- End diff --
    
    Just a magic number :-) it could be 10 or 20 or 50.
    
    After some experiments, this correction still have biased errors from -10% to 1%.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53783574
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19421/consoleFull) for   PR 2142 at commit [`d306492`](https://github.com/apache/spark/commit/d30649283cb20f00dbd1ec0c25bdbc4815cf52d6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53669649
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19372/consoleFull) for   PR 2142 at commit [`dfd2a2a`](https://github.com/apache/spark/commit/dfd2a2ad06e7e61bbd18d456c896ea3429714472).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16915312
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -72,7 +72,7 @@ def portable_hash(x):
             for i in x:
                 h ^= portable_hash(i)
                 h *= 1000003
    -            h &= 0xffffffff
    +            h &= sys.maxint
    --- End diff --
    
    In 64-bit machines, the hash of tuple() should be 64 bits (sys.maxint is 64bits). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53790095
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19418/consoleFull) for   PR 2142 at commit [`ded624f`](https://github.com/apache/spark/commit/ded624f11ece4bb53a2b56a3d02697b428f94a8e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16914599
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,46 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    +        >>> 950 < n < 1050
    +        True
    +        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    +        >>> 18 < n < 22
    +        True
    +        """
    +        if relativeSD < 0.000017:
    +            raise ValueError("relativeSD should be greater than 0.000017")
    +        if relativeSD > 0.37:
    +            raise ValueError("relativeSD should be smaller than 0.37")
    +        hashRDD = self.map(lambda x: portable_hash(x) % sys.maxint)
    +        c = hashRDD._to_java_object_rdd().countApproxDistinct(relativeSD)
    +        # range of hash is [0, sys.maxint]
    +        if c > sys.maxint / 30:
    --- End diff --
    
    Why the 30 here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53785845
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19422/consoleFull) for   PR 2142 at commit [`9d2565f`](https://github.com/apache/spark/commit/9d2565fe6c6d1956e66f830c7e8c3a255998e6b1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16861865
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,38 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        This support all the types of objects, which is supported by
    +        Pyrolite, nearly all builtin types.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    --- End diff --
    
    I had changed it to do hash calculation in Python, so it can support all hashable types in Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16816578
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,38 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        This support all the types of objects, which is supported by
    +        Pyrolite, nearly all builtin types.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    --- End diff --
    
    can u add a test to make sure that if u have 1000 non-distinct elements (i.e. the same element appearing 1000 times), this doesn't return ~ 1000? 
    
    Asking because I'm not sure how pyspark interacts with Java - if it is through byte array, then the hashcode could be wrong for byte arrays. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53669881
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19373/consoleFull) for   PR 2142 at commit [`4cba98f`](https://github.com/apache/spark/commit/4cba98f05f4d3d8373be577946d0f81b44bedb8b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16819027
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,38 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        This support all the types of objects, which is supported by
    +        Pyrolite, nearly all builtin types.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    --- End diff --
    
    Good point, it helped me to find out an invalid test case of tuple, which will be unpickled as []Object in Java, and the hashCode of it is not determined by content, so I changed it into set([]), which should have similar behavior across Python and Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53672142
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19373/consoleFull) for   PR 2142 at commit [`4cba98f`](https://github.com/apache/spark/commit/4cba98f05f4d3d8373be577946d0f81b44bedb8b).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53647569
  
    @mateiz @JoshRosen this one was separated from #1791, please take a look at it. ( @mateiz had reviewed this part in that PR, sorry for the duplicated reviewing).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53782111
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19418/consoleFull) for   PR 2142 at commit [`ded624f`](https://github.com/apache/spark/commit/ded624f11ece4bb53a2b56a3d02697b428f94a8e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53792618
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19421/consoleFull) for   PR 2142 at commit [`d306492`](https://github.com/apache/spark/commit/d30649283cb20f00dbd1ec0c25bdbc4815cf52d6).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53793682
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19422/consoleFull) for   PR 2142 at commit [`9d2565f`](https://github.com/apache/spark/commit/9d2565fe6c6d1956e66f830c7e8c3a255998e6b1).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16914397
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -72,7 +72,7 @@ def portable_hash(x):
             for i in x:
                 h ^= portable_hash(i)
                 h *= 1000003
    -            h &= 0xffffffff
    +            h &= sys.maxint
    --- End diff --
    
    Why this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-54229591
  
    LGTM, so I've merged this into `master`.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16915424
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,46 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    +        >>> 950 < n < 1050
    +        True
    +        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    +        >>> 18 < n < 22
    +        True
    +        """
    +        if relativeSD < 0.000017:
    +            raise ValueError("relativeSD should be greater than 0.000017")
    +        if relativeSD > 0.37:
    +            raise ValueError("relativeSD should be smaller than 0.37")
    +        hashRDD = self.map(lambda x: portable_hash(x) % sys.maxint)
    --- End diff --
    
    hash() in Python could be negative, but portable_hash() of tuple() are positive, we should make the hash of them share the same range.
    
    hash % sys.maxint should be better than abs(hash), so for small range of [-N, N], we will not have hash collision for them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2142


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53813878
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19428/consoleFull) for   PR 2142 at commit [`c38c4e4`](https://github.com/apache/spark/commit/c38c4e474849e3c2a13b5f508ee217ce63dd56bf).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53805503
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19428/consoleFull) for   PR 2142 at commit [`c38c4e4`](https://github.com/apache/spark/commit/c38c4e474849e3c2a13b5f508ee217ce63dd56bf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2142#issuecomment-53935085

To check my understanding of the error-correction code:

- Due to hash collisions, we may underestimate the true number of distinct items.
- Given a set of `k` random 32-bit hashes, the exact probability of _at least one_ collision ([from your link](http://preshing.com/20110504/hash-collision-probabilities/)) is _1 - e^(-k(k-1)/2*2^32)_. We can approximate _1 - e^X_ by _X_ for small _X_. Therefore, the approximate probability of some hash collision is _-k(k-1)/2*2^32_, which is roughly _k^2/2^33_.

I'm a bit confused about how the current correction term works:

```python
c = - sys.maxint * log(1 - float(c) / sys.maxint)
```

It looks like this is correcting for _overestimates_ of the number of distinct elements by subtracting a term based on the collision probability. In general, won't collisions cause us underestimate instead of overestimating?

Maybe we should approach this by treating the true number of distinct items (_k_) as a random variable and figuring out the maximum likelihood estimator of _k_ given an observation _c_ of the result of `countApproxDistinct`.

Before we consider that, though, I wonder whether we even need a correction term. Doesn't the Java implementation of `countApproxDistinct` already introduce hashing errors that are corrected for in its implementation? I don't think that the two levels of hashing will introduce more error, since I think the hashcode of a Java integer should just be its value.

_Note_: I'm not a statistician; please correct me if I've gotten anything wrong.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-54211600
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19601/consoleFull) for   PR 2142 at commit [`e20da47`](https://github.com/apache/spark/commit/e20da472979ccb4b12fe8d7d7a059fde9af4ef60).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53803655
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19423/consoleFull) for   PR 2142 at commit [`2ab157c`](https://github.com/apache/spark/commit/2ab157c567fe9f45394160377ac7f26c05ac2f0b).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-54203133
  
    After discussion with @JoshRosen offline, we realized that it does not need correction in Python if they have the same hash space both in Python and Java, so I changed the has space (mapping to 2^32) and remove the correction.
    
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-54203358
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19601/consoleFull) for   PR 2142 at commit [`e20da47`](https://github.com/apache/spark/commit/e20da472979ccb4b12fe8d7d7a059fde9af4ef60).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16914423
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -993,7 +993,7 @@ abstract class RDD[T: ClassTag](
        */
       @Experimental
       def countApproxDistinct(p: Int, sp: Int): Long = {
    -    require(p >= 4, s"p ($p) must be greater than 0")
    +    require(p >= 4, s"p ($p) must be at least 4")
    --- End diff --
    
    Good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53672163
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19372/consoleFull) for   PR 2142 at commit [`dfd2a2a`](https://github.com/apache/spark/commit/dfd2a2ad06e7e61bbd18d456c896ea3429714472).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-54228716
  
    Yeah, I don't think we any special correction because Java will use the hashcodes chosen in Python (since Integer.hashcode is just the integer's value).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2142#discussion_r16914486
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1993,11 +1993,46 @@ def meanApprox(self, timeout, confidence=0.95):
             >>> (rdd.meanApprox(1000) - r) / r < 0.05
             True
             """
    -        jrdd = self.map(float)._to_jrdd()
    +        jrdd = self.map(float)._to_java_object_rdd()
             jdrdd = self.ctx._jvm.JavaDoubleRDD.fromRDD(jrdd.rdd())
             r = jdrdd.meanApprox(timeout, confidence).getFinalValue()
             return BoundedFloat(r.mean(), r.confidence(), r.low(), r.high())
     
    +    def countApproxDistinct(self, relativeSD=0.05):
    +        """
    +        :: Experimental ::
    +        Return approximate number of distinct elements in the RDD.
    +
    +        The algorithm used is based on streamlib's implementation of
    +        "HyperLogLog in Practice: Algorithmic Engineering of a State
    +        of The Art Cardinality Estimation Algorithm", available
    +        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.
    +
    +        @param relativeSD Relative accuracy. Smaller values create
    +                           counters that require more space.
    +                           It must be greater than 0.000017.
    +
    +        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    +        >>> 950 < n < 1050
    +        True
    +        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    +        >>> 18 < n < 22
    +        True
    +        """
    +        if relativeSD < 0.000017:
    +            raise ValueError("relativeSD should be greater than 0.000017")
    +        if relativeSD > 0.37:
    +            raise ValueError("relativeSD should be smaller than 0.37")
    +        hashRDD = self.map(lambda x: portable_hash(x) % sys.maxint)
    --- End diff --
    
    I guess this is because portable_hash() might return a long but you want to make sure it fits in an int?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org