You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/08/22 06:35:47 UTC

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2091

    [SPARK-2871] [PySpark] add histgram() API

            Compute a histogram using the provided buckets. The buckets
            are all open to the right except for the last which is closed.
            e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
            which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
            and 50 we would have a histogram of 1,0,1.
    
            If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
            this can be switched from an O(log n) inseration to O(1) per
            element(where n = # buckets), if you set `even` to True.
    
            Buckets must be sorted and not contain any duplicates, must be
            at least two elements.
    
            If `buckets` is a number, it will generates buckets which is
            evenly spaced between the minimum and maximum of the RDD. For
            example, if the min value is 0 and the max is 100, given buckets
            as 2, the resulting buckets will be [0,50) [50,100]. buckets must
            be at least 1 If the RDD contains infinity, NaN throws an exception
            If the elements in RDD do not vary (max == min) always returns
            a single bucket.
    
            It will return an tuple of buckets and histogram.
    
            >>> rdd = sc.parallelize(range(51))
            >>> rdd.histogram(2)
            ([0, 25, 50], [25, 26])
            >>> rdd.histogram([0, 5, 25, 50])
            ([0, 5, 25, 50], [5, 20, 26])
            >>> rdd.histogram([0, 15, 30, 45, 60], True)
            ([0, 15, 30, 45, 60], [15, 15, 15, 6])


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark histgram

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2091.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2091
    
----
commit 0e18a2d3e98ca940320153b192fa03306cfd8f9a
Author: Davies Liu <da...@gmail.com>
Date:   2014-08-22T04:33:38Z

    add histgram() API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53355210
  
    @JoshRosen sure doing a linear scan works, the evenBuckets was because the caller  knows if its providing even buckets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53375345
  
    When you guys merge this, please close https://github.com/apache/spark/pull/122 as well. You should just edit the pull request description to also say "closes #122", and then it will close it when it merges. Let the person there know that this is now implemented as well, since that was a separate implementation. 
    
    BTW regarding evenBuckets -- I think it's more or less an accident that we exposed it in Scala, so I'd leave it out here. One awkward thing is that the Scala one is probably not even precise; it's not super clear that the floating-point math will work out such that each bucket's start / end is exactly the same result you get when doing the O(1) bucket computation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53343907
  
    @JoshRosen I had removed evenBuckets, also added more tests, and some test cases for `str` type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53481077
  
    LGTM.  Merged into `master` and `branch-1.1` (since it only adds new methods and doesn't modify any existing code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53145884
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19098/consoleFull) for   PR 2091 at commit [`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53024894
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19071/consoleFull) for   PR 2091 at commit [`0e18a2d`](https://github.com/apache/spark/commit/0e18a2d3e98ca940320153b192fa03306cfd8f9a).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53022070
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19071/consoleFull) for   PR 2091 at commit [`0e18a2d`](https://github.com/apache/spark/commit/0e18a2d3e98ca940320153b192fa03306cfd8f9a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53378149
  
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2091


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53180978
  
    @mateiz @JoshRosen I would like to change `evenBuckets` to `even`,  the later one is meaningful enough and much shorter. 
    
    One concern is that we will have different names in Scala/Java and Python, for same parameter. These languages are much different, Java does not have named parameters, change the name does not break compatibility, Scala do have named parameters, but in this case, we have three API for histgram(), it's hardly that user will call it like this:
    
        rdd.histgram(buckets, evenBuckets=True)
                
    In Python, the `buckets` can be int, it's already different than that in Scala/Java, so having `evenBuckets` as `even` also will not be a surprise here.
    
    Here's my 2 cents, how do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53354293
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19165/consoleFull) for   PR 2091 at commit [`a322f8a`](https://github.com/apache/spark/commit/a322f8a98f3aab60eb4e6dec9eae08c3b6f5e97b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53143433
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19094/consoleFull) for   PR 2091 at commit [`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mattf <gi...@git.apache.org>.

Github user mattf commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16630390
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    +        evenly spaced between the minimum and maximum of the RDD. For
    +        example, if the min value is 0 and the max is 100, given buckets
    +        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    +        be at least 1 If the RDD contains infinity, NaN throws an exception
    +        If the elements in RDD do not vary (max == min) always returns
    +        a single bucket.
    +
    +        It will return an tuple of buckets and histogram.
    +
    +        >>> rdd = sc.parallelize(range(51))
    +        >>> rdd.histogram(2)
    +        ([0, 25, 50], [25, 26])
    +        >>> rdd.histogram([0, 5, 25, 50])
    +        ([0, 5, 25, 50], [5, 20, 26])
    +        >>> rdd.histogram([0, 15, 30, 45, 60], True)
    +        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    +        """
    +
    +        if isinstance(buckets, (int, long)):
    +            if buckets < 1:
    +                raise ValueError("buckets should not less than 1")
    +
    +            # filter out non-comparable elements
    +            self = self.filter(lambda x: x is not None and not isnan(x))
    +
    +            # faster than stats()
    +            def minmax(a, b):
    +                return min(a[0], b[0]), max(a[1], b[1])
    +            try:
    +                minv, maxv = self.map(lambda x: (x, x)).reduce(minmax)
    +            except TypeError as e:
    +                if e.message == "reduce() of empty sequence with no initial value":
    --- End diff --
    
    the goal of propagating messages that do not expose implementation details is good imho
    
    are you confident that the "empty sequence" is the only exception that could arise?
    
    i was thinking about a mixed types in the rdd, but maybe that's not a problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mattf <gi...@git.apache.org>.

Github user mattf commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16627753
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    --- End diff --
    
    nit - buckets which are


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53145718
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mattf <gi...@git.apache.org>.

Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53416722
  
    lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53172796
  
    These are _excellent_ unit tests.  I ran a coverage report with [coverage.py](http://nedbatchelder.com/code/coverage/) and it reports essentially 100% coverage for the driver-side histogram code in rdd.py.
    
    ```bash
    export PYTHONPATH=$SPARK_HOME/python
    coverage run python/pyspark/tests.py TestRDDFunctions.test_histogram
    coverage html
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53350206
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19159/consoleFull) for   PR 2091 at commit [`84e85fa`](https://github.com/apache/spark/commit/84e85fa3f4358f608e6a362b3931d5a11ec2755f).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53343906
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19159/consoleFull) for   PR 2091 at commit [`84e85fa`](https://github.com/apache/spark/commit/84e85fa3f4358f608e6a362b3931d5a11ec2755f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16628887
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    +        evenly spaced between the minimum and maximum of the RDD. For
    +        example, if the min value is 0 and the max is 100, given buckets
    +        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    +        be at least 1 If the RDD contains infinity, NaN throws an exception
    +        If the elements in RDD do not vary (max == min) always returns
    +        a single bucket.
    +
    +        It will return an tuple of buckets and histogram.
    +
    +        >>> rdd = sc.parallelize(range(51))
    +        >>> rdd.histogram(2)
    +        ([0, 25, 50], [25, 26])
    +        >>> rdd.histogram([0, 5, 25, 50])
    +        ([0, 5, 25, 50], [5, 20, 26])
    +        >>> rdd.histogram([0, 15, 30, 45, 60], True)
    +        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    +        """
    +
    +        if isinstance(buckets, (int, long)):
    +            if buckets < 1:
    +                raise ValueError("buckets should not less than 1")
    +
    +            # filter out non-comparable elements
    +            self = self.filter(lambda x: x is not None and not isnan(x))
    +
    +            # faster than stats()
    +            def minmax(a, b):
    +                return min(a[0], b[0]), max(a[1], b[1])
    +            try:
    +                minv, maxv = self.map(lambda x: (x, x)).reduce(minmax)
    +            except TypeError as e:
    +                if e.message == "reduce() of empty sequence with no initial value":
    --- End diff --
    
    Yes, it's fragile. If we propagate the message from reduce(), user may treat this is  a bug of PySpark, because use did not call reduce() explicitly.
    
    Or we define a special exception for reduce() ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mattf <gi...@git.apache.org>.

Github user mattf commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16627566
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    +        evenly spaced between the minimum and maximum of the RDD. For
    +        example, if the min value is 0 and the max is 100, given buckets
    +        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    +        be at least 1 If the RDD contains infinity, NaN throws an exception
    +        If the elements in RDD do not vary (max == min) always returns
    +        a single bucket.
    +
    +        It will return an tuple of buckets and histogram.
    +
    +        >>> rdd = sc.parallelize(range(51))
    +        >>> rdd.histogram(2)
    +        ([0, 25, 50], [25, 26])
    +        >>> rdd.histogram([0, 5, 25, 50])
    +        ([0, 5, 25, 50], [5, 20, 26])
    +        >>> rdd.histogram([0, 15, 30, 45, 60], True)
    +        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    +        """
    +
    +        if isinstance(buckets, (int, long)):
    +            if buckets < 1:
    +                raise ValueError("buckets should not less than 1")
    --- End diff --
    
    how about "number of buckets must be >= 1" ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53359625
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19165/consoleFull) for   PR 2091 at commit [`a322f8a`](https://github.com/apache/spark/commit/a322f8a98f3aab60eb4e6dec9eae08c3b6f5e97b).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53147036
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19098/consoleFull) for   PR 2091 at commit [`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16633581
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    +        evenly spaced between the minimum and maximum of the RDD. For
    +        example, if the min value is 0 and the max is 100, given buckets
    +        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    +        be at least 1 If the RDD contains infinity, NaN throws an exception
    +        If the elements in RDD do not vary (max == min) always returns
    +        a single bucket.
    +
    +        It will return an tuple of buckets and histogram.
    +
    +        >>> rdd = sc.parallelize(range(51))
    +        >>> rdd.histogram(2)
    +        ([0, 25, 50], [25, 26])
    +        >>> rdd.histogram([0, 5, 25, 50])
    +        ([0, 5, 25, 50], [5, 20, 26])
    +        >>> rdd.histogram([0, 15, 30, 45, 60], True)
    +        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    +        """
    +
    +        if isinstance(buckets, (int, long)):
    +            if buckets < 1:
    +                raise ValueError("buckets should not less than 1")
    +
    +            # filter out non-comparable elements
    +            self = self.filter(lambda x: x is not None and not isnan(x))
    +
    +            # faster than stats()
    +            def minmax(a, b):
    +                return min(a[0], b[0]), max(a[1], b[1])
    +            try:
    +                minv, maxv = self.map(lambda x: (x, x)).reduce(minmax)
    +            except TypeError as e:
    +                if e.message == "reduce() of empty sequence with no initial value":
    --- End diff --
    
    Although this is fragile, nothing bad will happen if this breaks because we'll end up re-raising the exception on line 908.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53144107
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19094/consoleFull) for   PR 2091 at commit [`d9a0722`](https://github.com/apache/spark/commit/d9a07225003148805505a703082f34ad397b974e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2091#issuecomment-53305144
  
    Why do we even need `evenBuckets`?  Can't we just check whether the buckets are evenly-spaced and automatically perform the optimization if they are?  This only requires a linear scan through the buckets array.  I suppose that we could run into problems if the buckets are evenly-spaced but small rounding / precision errors cause us to mark the buckets as non-even.
    
    /cc @holdenk who implemented the `DoublePairRDDFunctions.histogram()` function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2871] [PySpark] add histgram() API

Posted by mattf <gi...@git.apache.org>.

Github user mattf commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2091#discussion_r16627673
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -856,6 +856,104 @@ def redFunc(left_counter, right_counter):
     
             return self.mapPartitions(lambda i: [StatCounter(i)]).reduce(redFunc)
     
    +    def histogram(self, buckets, evenBuckets=False):
    +        """
    +        Compute a histogram using the provided buckets. The buckets
    +        are all open to the right except for the last which is closed.
    +        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    +        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    +        and 50 we would have a histogram of 1,0,1.
    +
    +        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    +        this can be switched from an O(log n) inseration to O(1) per
    +        element(where n = # buckets), if you set `even` to True.
    +
    +        Buckets must be sorted and not contain any duplicates, must be
    +        at least two elements.
    +
    +        If `buckets` is a number, it will generates buckets which is
    +        evenly spaced between the minimum and maximum of the RDD. For
    +        example, if the min value is 0 and the max is 100, given buckets
    +        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    +        be at least 1 If the RDD contains infinity, NaN throws an exception
    +        If the elements in RDD do not vary (max == min) always returns
    +        a single bucket.
    +
    +        It will return an tuple of buckets and histogram.
    +
    +        >>> rdd = sc.parallelize(range(51))
    +        >>> rdd.histogram(2)
    +        ([0, 25, 50], [25, 26])
    +        >>> rdd.histogram([0, 5, 25, 50])
    +        ([0, 5, 25, 50], [5, 20, 26])
    +        >>> rdd.histogram([0, 15, 30, 45, 60], True)
    +        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    +        """
    +
    +        if isinstance(buckets, (int, long)):
    +            if buckets < 1:
    +                raise ValueError("buckets should not less than 1")
    +
    +            # filter out non-comparable elements
    +            self = self.filter(lambda x: x is not None and not isnan(x))
    +
    +            # faster than stats()
    +            def minmax(a, b):
    +                return min(a[0], b[0]), max(a[1], b[1])
    +            try:
    +                minv, maxv = self.map(lambda x: (x, x)).reduce(minmax)
    +            except TypeError as e:
    +                if e.message == "reduce() of empty sequence with no initial value":
    --- End diff --
    
    checking the message is fragile, it can change and break this code
    
    how about letting all TypeErrors propagate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org