You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/10/27 05:12:35 UTC

[GitHub] spark pull request #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditio...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/19584

    [SPARK-22347][SQL] Disable PythonUDFs in conditional expressions

    ## What changes were proposed in this pull request?
    
    Under the current execution mode of Python UDFs, we don't support Python UDFs as branch values or else value in CaseWhen expression. The execution of batch/vectorized Python UDFs evaluates the UDFs in an operator at once. It breaks the semantics of the conditional expressions and causes failures:
    
    ```python
    from pyspark.sql import functions as F, Row, types
    
    def divideByValue():
        def fn(value): return 10 / int(value)
        return F.udf(fn, types.IntegerType())
    
    df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
    
    x = F.col('x')
    df2 = df.select(F.when((x > 0), divideByValue()(x)))
    df2.show()
    ```
    
    It might not be easy to let it support conditional execution. It also doesn't make much sense in the context of vectorized Python UDFs.
    
    To reduce confusion from end users, this patch disables the usage of Python UDFs in conditional expressions such as CaseWhen, and suggests to implement the condition logic in Python UDFs. It can be very easy to incorporate the condition logic of CaseWhen into the Python UDFs, e.g. for the above example:
    
    ```python
    def divideByValue():
        def fn(value): return 10 / int(value) if (value > 0) else None
        return udf(fn, types.IntegerType())
    
    df2 = df.select(divideByValue()(x))
    df2.show()
    ```
    ```
    +-----+
    |fn(x)|
    +-----+
    |    2|
    | null|
    +-----+
    ```
    
    ## How was this patch tested?
    
    Added python tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-22347-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19584
    
----
commit cb820d59153562e316c535543242a6bdc599bdee
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-10-26T11:56:02Z

    Disable PythonUDFs in conditional expressions.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    cc @HyukjinKwon @ueshin @BryanCutler @cloud-fan To use Python UDFs in conditional expressions breaks original semantics and possibly causes failure. Currently this patch simply disables this usage.
    
    What do you think about it? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditio...

Posted by viirya <gi...@git.apache.org>.
Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/19584


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    **[Test build #83108 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83108/testReport)** for PR 19584 at commit [`cb820d5`](https://github.com/apache/spark/commit/cb820d59153562e316c535543242a6bdc599bdee).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83113/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    **[Test build #83108 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83108/testReport)** for PR 19584 at commit [`cb820d5`](https://github.com/apache/spark/commit/cb820d59153562e316c535543242a6bdc599bdee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83108/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    **[Test build #83113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83113/testReport)** for PR 19584 at commit [`cb820d5`](https://github.com/apache/spark/commit/cb820d59153562e316c535543242a6bdc599bdee).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    As I tried, pandas udf can be more resistant to such error due to the capability of pandas Series. So leaving it open to use with conditional expressions, seems not a big deal if don't consider performance.
    
    The only problem is non vectorized python udfs. To disable it directly will break compatibility. Maybe we still need to support it...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    **[Test build #83113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83113/testReport)** for PR 19584 at commit [`cb820d5`](https://github.com/apache/spark/commit/cb820d59153562e316c535543242a6bdc599bdee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19584: [SPARK-22347][SQL] Disable PythonUDFs in conditional exp...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19584
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org