You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by freeman-lab <gi...@git.apache.org> on 2014/08/02 01:09:01 UTC

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

GitHub user freeman-lab opened a pull request:

    https://github.com/apache/spark/pull/1725

    StatCounter on NumPy arrays [PYSPARK][SPARK-2012]

    These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here  (https://issues.apache.org/jira/browse/SPARK-2012). 
    
    If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.
    
    New unit tests added, along with a check for NumPy in the tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/freeman-lab/spark numpy-max-statcounter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1725
    
----
commit 176a127c3c35512a2690ad8ccfb020ea94e42596
Author: Jeremy Freeman <th...@gmail.com>
Date:   2014-08-01T22:47:50Z

    Use numpy arrays in StatCounter
    
    - If NumPy is installed, use maximum/minimum/sqry so that StatCounters
    work on NumPy arrays
    - Otherwise, fall back on scalar operators

commit 1c8a832ac71dafad893b3f92d12d57c284496402
Author: Jeremy Freeman <th...@gmail.com>
Date:   2014-08-01T22:48:04Z

    Unit tests for StatCounter with NumPy arrays

commit 875414c6d79ef8e8a8938cf888eba71a9bdad070
Author: Jeremy Freeman <th...@gmail.com>
Date:   2014-08-01T23:04:16Z

    Fixed indents

commit 8e764dd0e77e1c32827859fe09019c9c912defb1
Author: Jeremy Freeman <th...@gmail.com>
Date:   2014-08-01T23:07:31Z

    Explicit numpy imports

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1725#discussion_r15724547
  
    --- Diff: python/pyspark/statcounter.py ---
    @@ -20,6 +20,14 @@
     import copy
     import math
     
    +_have_numpy = False
    +try:
    +    from numpy import maximum, minimum, sqrt
    +    _have_numpy = True
    +except:
    --- End diff --
    
    It's better to have ImportError here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by freeman-lab <gi...@git.apache.org>.
Github user freeman-lab commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50954666
  
    @JoshRosen @davies great, thanks guys!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by freeman-lab <gi...@git.apache.org>.
Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1725#discussion_r15725760
  
    --- Diff: python/pyspark/statcounter.py ---
    @@ -20,6 +20,14 @@
     import copy
     import math
     
    +_have_numpy = False
    +try:
    +    from numpy import maximum, minimum, sqrt
    +    _have_numpy = True
    +except:
    --- End diff --
    
    Nice! This is much better, updating the PR now...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1725#discussion_r15724621
  
    --- Diff: python/pyspark/statcounter.py ---
    @@ -20,6 +20,14 @@
     import copy
     import math
     
    +_have_numpy = False
    +try:
    +    from numpy import maximum, minimum, sqrt
    +    _have_numpy = True
    +except:
    --- End diff --
    
    How about do in this way:
    
     try:
         from numpy import maximum, minimum, sqrt
     except ImportError:
         maximum = max
         minimum = min
         sqrt = math.sqrt
    
    This will simplify later codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50944760
  
    QA tests have started for PR 1725. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1725#discussion_r15726945
  
    --- Diff: python/pyspark/tests.py ---
    @@ -38,12 +38,19 @@
     from pyspark.shuffle import Aggregator, InMemoryMerger, ExternalMerger
     
     _have_scipy = False
    +_have_numpy = False
     try:
         import scipy.sparse
         _have_scipy = True
     except:
         # No SciPy, but that's okay, we'll skip those tests
         pass
    +try:
    +    from numpy import array
    --- End diff --
    
    just try to import numpy, this `array` will overwrite array.array, make other unit tests fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50947072
  
    Thanks for contributing this patch, it will be cool to merge it in 1.1 release.
    
    PS: code freeze will be happen tonight:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50952267
  
    QA results for PR 1725:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1725


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50947872
  
    QA results for PR 1725:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17714/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50954571
  
    This looks good.  At first, I was concerned that element-wise operations might change behavior for calling `stats()` on an RDD of Python lists of numbers (`sc.parallelize([[1, 0, 1], [4, -1, 4]]).stats()`), but that currently crashes in Spark 1.0, so this patch won't change users' results.
    
    ```python
    >>> from numpy import maximum
    >>> maximum([1, 0, 1], [4, -1, 4])
    array([4, 0, 4])
    >>> max([1, 0, 1], [4, -1, 4])
    [4, -1, 4]
    ```
    
    I ran the PySpark tests locally and they passed, so I've merged this.  Thanks Jeremy!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by freeman-lab <gi...@git.apache.org>.
Github user freeman-lab commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1725#discussion_r15727298
  
    --- Diff: python/pyspark/tests.py ---
    @@ -38,12 +38,19 @@
     from pyspark.shuffle import Aggregator, InMemoryMerger, ExternalMerger
     
     _have_scipy = False
    +_have_numpy = False
     try:
         import scipy.sparse
         _have_scipy = True
     except:
         # No SciPy, but that's okay, we'll skip those tests
         pass
    +try:
    +    from numpy import array
    --- End diff --
    
    @davies thanks, good catch, should be fixed now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50954198
  
    lgtm
    
    @JoshRosen Could you help to take a look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: StatCounter on NumPy arrays [PYSPARK][SPARK-20...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1725#issuecomment-50951033
  
    QA tests have started for PR 1725. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17733/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---