You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by 0x0FFF <gi...@git.apache.org> on 2015/09/08 23:36:08 UTC

[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

GitHub user 0x0FFF opened a pull request:

    https://github.com/apache/spark/pull/8658

    [SPARK-9014][SQL] Allow Python spark API to use built-in exponential operator

    This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
    Added functionality: `Column` object in Python now supports exponential operator `**`
    Example:
    ```
    from pyspark.sql import *
    df = sqlContext.createDataFrame([Row(a=2)])
    df.select(3**df.a,df.a**3,df.a**df.a).collect()
    ```
    Outputs:
    ```
    [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/0x0FFF/spark SPARK-9014

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8658
    
----
commit 485e9ba1d1dc2df81130a8bd883105646247d449
Author: 0x0FFF <pr...@gmail.com>
Date:   2015-09-08T21:27:28Z

    [SPARK-9014][SQL] Allow Python spark API to use built-in exponential operator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139131882
  
    @davies, could you take a look, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39186863
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    We already have `pow` in pyspark.sql.functions.
    
    For here, it's easy to do like this:
    ```
    from pyspark.sql.function import pow
    __pow__ = pow
    __rpow__ = lambda c, other: pow(other, c)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-138709945
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139671003
  
      [Test build #1745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1745/console) for   PR 8658 at commit [`aecc0c2`](https://github.com/apache/spark/commit/aecc0c2fb2e4fddcec97e4cca9eac243df848acb).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r38985416
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -91,6 +91,17 @@ def _(self):
         return _
     
     
    +def _bin_func_op(name, reverse=False, doc="binary function"):
    --- End diff --
    
    If we are _just_ looking at adding pow then it might make more sense to add pow to column, but if we are looking at making it easy to access the rest of the functions in functions.scala this sounds good (although maybe add a tiny comment explaining the purpose).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139476639
  
    Agree, with commit aecc0c2 I reverted to the first option and replaced `float` with `_create_column_from_literal` as proposed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139667259
  
    LGTM, waiting for tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139667413
  
      [Test build #1745 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1745/consoleFull) for   PR 8658 at commit [`aecc0c2`](https://github.com/apache/spark/commit/aecc0c2fb2e4fddcec97e4cca9eac243df848acb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-138851028
  
    @holdenk, there are two ways of implementing this:
    
    1. The one I've done: adding functionality to utilize binary function operation from `org.apache.spark.sql.functions` to `Column.py` and utilize `pow()` function from there for `__pow__` and `__rpow__`. Pros: with this approach you can add other functions from `org.apache.spark.sql.functions` to `Column.py`. Cons: non-consistent Column APIs between Python and Scala (to be fair, Scala does not have ** out of the box, so it is a question whether it is really an inconsistency)
    2. Add implementation of `**` and `pow` to `org.apache.spark.sql.Column` and utilize it in `Column.py` with existing `_bin_op` and `_reverse_op`. Pros: `**` operation is added to Scala `org.apache.spark.sql.Column` as well, plus it turns into slightly less new code. Cons: this would cover only `pow()` implementation, if in the future you would need to utilize other function from `org.apache.spark.sql.functions` in Python you would have to change `org.apache.spark.sql.Column` first once again. Plus in Scala the case with `3**col("a")` will not work as it would look for `**` implementation in `Int`
    
    In my opinion both options are possible, and second one might even be a bit better. I can easily switch to the second option, I've already implemented it locally. What is your view?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8658


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r38985409
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    So this doesn't quite match the scala API (which I suppose isn't the end of the world), but would it possible make sense to have a similar functions.py file to match the scala API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139384669
  
    Please, also check out the implementation from last commit. In my opinion it is much more consistent. I just cannot implement `_pow` in `column.py`, it looks much like a duck tape. General `_bin_func_op` function with converting second parameter to `Column` type with `lit` call is better, but still in my opinion the implementation from commit 40cbcb4 is much better - code is easier to read, more consistent, and it is adding less code (only 4 lines, excluding comment and tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39189353
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    @davies, `pow` function in `pyspark.sql.functions` is created with function `_create_binary_mathfunction` which uses `Column` internally, thus it cannot be simply imported from `pyspark.sql.fuction`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139405282
  
    +1 on not having this for Scala. There is already a pow function that do pow(x, y).
    
    We should just do this for Python.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139386349
  
    cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39193022
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    That make sense, but _bin_func_op expect that `other` should be Column or float, it's not in a shape that we could easily reused for others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39190746
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    I see, I'd like to go with current approach. 
    
    We could change to `_bin_func_op` to `_pow`, use it for `__pow__` and `__rpow__`, it would be more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139386251
  
    @0x0FFF I think `**` only make sense in Python, so we should not introduce `**` into Scala (also pow).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8658#issuecomment-139316438
  
    Jenkins, OK to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39212928
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    This sounds better, go for it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39192640
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    I can rename, but what if later we would implement something like `__lshift__` or `__rshift__` to allow the syntax `df.a << 2`, then we would have to either rename `_pow` back to `_bin_func_op` and utilize it, or add one more functions. Now its clear that `_bin_func_op` allows you to utilize binary function. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9014][SQL] Allow Python spark API to us...

Posted by 0x0FFF <gi...@git.apache.org>.
Github user 0x0FFF commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8658#discussion_r39195144
  
    --- Diff: python/pyspark/sql/column.py ---
    @@ -151,6 +162,8 @@ def __init__(self, jc):
         __rdiv__ = _reverse_op("divide")
         __rtruediv__ = _reverse_op("divide")
         __rmod__ = _reverse_op("mod")
    +    __pow__ = _bin_func_op("pow")
    --- End diff --
    
    Agree. What if I replace
    ```
    jc = other._jc if isinstance(other, Column) else float(other)
    ```
    with
    ```
    jc = other._jc if isinstance(other, Column) else _create_column_from_literal(other)
    ```
    Would it still worth renaming to `_pow`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org