You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jsnowacki <gi...@git.apache.org> on 2017/10/06 10:40:20 UTC

[GitHub] spark pull request #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in...

GitHub user jsnowacki opened a pull request:

    https://github.com/apache/spark/pull/19443

    [SPARK-22212][SQL][PySpark] Some SQL functions in Python fail with string column name

    ## What changes were proposed in this pull request?
    
    The issue in JIRA: [SPARK-22212](https://issues.apache.org/jira/browse/SPARK-22212)
    
    Most of the functions in `pyspark.sql.functions` allow usage of both column name string and `Column` object. But there are some functions, like `trim`, that require to pass only `Column`. See below code for explanation.
    
    ```
    >>> import pyspark.sql.functions as func
    >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"])
    >>> df.select(func.trim(df["text"])).show()
    +----------+
    |trim(text)|
    +----------+
    |         a|
    |         b|
    |         c|
    |         d|
    |         e|
    +----------+
    >>> df.select(func.trim("text")).show()
    [...]
    Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.trim. Trace:
    py4j.Py4JException: Method trim([class java.lang.String]) does not exist
            at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
            at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
            at py4j.Gateway.invoke(Gateway.java:274)
            at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
            at py4j.commands.CallCommand.execute(CallCommand.java:79)
            at py4j.GatewayConnection.run(GatewayConnection.java:214)
            at java.lang.Thread.run(Thread.java:748)
    ```
    
    This is because most of the Python function calls map column name to `Column` in the Python function mapping, but functions created via `_create_function` pass them as is, if they are not `Column`. On the other hand, few functions that require the column name has been moved `functions_by_column_name`, and are created by `_create_function_by_column_name`.
    
    Note that this is only Python-side fix. Some Scala functions still do not have method to call them by string column name.
    
    ## How was this patch tested?
    
    Additional Python tests where written to accommodate this. It was tested via `UnitTest` in IDE and the overall `python\run_tests` script.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jsnowacki/spark-1 fix_func_str_to_col

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19443
    
----
commit c5dbd50361a37e9833708dc8985345fbf537e8d9
Author: Jakub Nowacki <j....@gmail.com>
Date:   2017-10-03T07:50:50Z

    [SPARK-22212] Fixing string to column mapping in Python functions

commit 9e52c6380ae8787d20e3442cfaf42cfb70caf4dc
Author: Jakub Nowacki <j....@gmail.com>
Date:   2017-10-06T09:07:26Z

    [SPARK-22212] Calling functions by string column name fixed and tested

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    **[Test build #82507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82507/testReport)** for PR 19443 at commit [`9e52c63`](https://github.com/apache/spark/commit/9e52c6380ae8787d20e3442cfaf42cfb70caf4dc).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by jsnowacki <gi...@git.apache.org>.

Github user jsnowacki commented on the issue:

https://github.com/apache/spark/pull/19443

@HyukjinKwon Thanks for pointing that out. I think the argument about consistency here is valid, though, I agree with @jaceklaskowski that changes should go one way or the other, i.e. allow string column names or remove this option completely. I don't really think that is ambiguous, as functions in SQL should either accept column object or column name as string, with the exception of `lit`; well, and `col` which accepts only string. Ambiguous functions like `concat` should always check for type and if it's a string, force change it to `Column`. This enforces usage of string only as the column name, and `lit` if it is an actual string literal.

Also, in case of Python the argument is also a little bit different. We need to take into account that many objects like `dict` or Pandas' `DataFrame` made addressing columns by string name more Pythonic way of dealing with columns. Thus, Python (and to some extent SQL and R) users expect to be able to use columns by their string names as using a special object for column is a bit more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot of users of these interfaces are not necessarily technical and strict `Column` usage argument is a bit alien to them. Thus, I would argue that even if `Column` argument would be enforced in Java and Scala API, other APIs should keep the by-column-name call possibile, as it is now done in Python, i.e. by mapping the string names into `Column`.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by jsnowacki <gi...@git.apache.org>.

Github user jsnowacki commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    The only other reason for Python I can think of, if the above are not compelling enough, is that the issue with function not having call-by-column-name option is that we'll get the error only at the run-time, due to the process of how the names and types are being mapped onto JVM code. Which is partially related to the dynamic nature of typing in Python. So while this is spotted on compile time in Java/Scala, which may be annoying and considered inconsistent, in Python task may fail after hours due to this issue. Same is true for R (and SQL to some extent).  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by jsnowacki <gi...@git.apache.org>.

Github user jsnowacki commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    @HyukjinKwon It takes the argument the functions imported via `_create_function` and tries to cast it to `Column` via `_to_java_column` functions. Before it was passing the argument as is, and, thus, some functions, like `trim` that didn't have String argument on the Scala API side, can take both string column name and `Column` object. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    @jsnowacki, does the current fix make all other Python functions, except non-applucable ones, take string parameters BTW? I could check it by myself but I guess you took a look already.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82507/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    Yea, I mean, I just wonder if this PR targets to fix all the same cases and instances within this Python API, `functions.py`. Are these all cases that need to support string case within this `functions.py` file?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    I could consider going ahead if the small fix makes all the things in `functions.py` consistent, but I guess it is not. I think I am less sure because, IIUC, we are not even clear on what to do with Scala-side on this although I think we should rather deprecate if I have to choose one side, for now.
    
    Let's close this for now. I guess this one should not be worth enough spending much time.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    I think it's okay to wait for a few days more and for other committers who might support or like this idea before closing this. I won't stay against.
    
    Providing more compelling reasons should be also another good option.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by jsnowacki <gi...@git.apache.org>.

Github user jsnowacki commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    This PR fixes only the functions created using `_create_function`, which to what I found, were the only ones affected by the issue. Rest of the functions either have different assumption or eventually do the same thing but more explicitly. I didn't check all of the functions in `functions.py` nor the test does that, as the list of functions to check would have to be done manually. Nonetheless, from what I've seen all the explicitly defined functions get column object via `_to_java_column`, which check the type of argument and casts string column names to `Column` object. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by jsnowacki <gi...@git.apache.org>.

Github user jsnowacki commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    OK, closing than. Should I leave the JIRA issue or close it as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    This might look okay within Python side because the fix looks minimised and does not actually increase complexity much; however, I think we focus on API consistency between other languages in general. In this sense, I think we tend to avoid adding those with string parameters in Scala side, please see https://github.com/apache/spark/pull/18144#issuecomment-304960488, https://github.com/apache/spark/pull/18144#issuecomment-304926567 and https://github.com/apache/spark/pull/18144#issuecomment-304955155. I am -0 on this because the workaround is simple anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    > I think the argument about consistency here is valid, though, I agree with @jaceklaskowski that changes should go one way or the other, i.e. allow string column names or remove this option completely.
    
    Yea but to be more specific, I tend to agree with @cloud-fan's suggestion, rather deprecating the current string support first. IMHO, It should have been no such arguments if we didn't start to support string parameters.
    
    > I don't really think that is ambiguous, as functions in SQL should either accept column object or column name as string, with the exception of lit; well, and col which accepts only string. Ambiguous functions like concat should always check for type and if it's a string, force change it to Column. This enforces usage of string only as the column name, and lit if it is an actual string literal.
    
    I think ambiguity in parameters is not the main concern but I think it's adding more complexity and growing APIs (causing maintenance cost and etc.), particularly in Scala side as discussed before.  In general, I usually go -0 or -1 if the workaround is easy and the existing usage does not look quite awkward. It is true that the current mixed (column and string types) looks a bit odd but this could be solved by deprecating it as I said above.
    
    > Also, in case of Python the argument is also a little bit different. We need to take into account that many objects like dict or Pandas' DataFrame made addressing columns by string name more Pythonic way of dealing with columns. Thus, Python (and to some extent SQL and R) users expect to be able to use columns by their string names as using a special object for column is a bit more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot of users of these interfaces are not necessarily technical and strict Column usage argument is a bit alien to them. Thus, I would argue that even if Column argument would be enforced in Java and Scala API, other APIs should keep the by-column-name call possibile, as it is now done in Python, i.e. by mapping the string names into Column.
    
    I think we should start this from consistent API support first in this case in general. I do like Pythonic way (to be more clear, Python only specific) we currently support, e.g., udf decorator, context manager support and etc. but, I mean, this case sounds not compelling enough for fixing this Python specifically alone, and this is why it's -0 not -1.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    **[Test build #82507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82507/testReport)** for PR 19443 at commit [`9e52c63`](https://github.com/apache/spark/commit/9e52c6380ae8787d20e3442cfaf42cfb70caf4dc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org