You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/01/11 15:50:00 UTC

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20237

    [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF

    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch.
    
    We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF.
    
    For example, please consider this example:
    
    ```python
    from pyspark.sql.functions import pandas_udf, col, lit
    
    df = spark.range(1)
    f = pandas_udf(lambda x, y: len(x) + y, LongType())
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 1|
    +------------------+
    ```
    
    ```python
    from pyspark.sql.functions import udf, col, lit
    
    df = spark.range(1)
    f = udf(lambda x, y: len(x) + y, "long")
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 4|
    +------------------+
    ```
    
    ## How was this patch tested?
    
    Manually built the doc and checked the output.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-22980

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20237
    
----
commit d2cfed308d343fb55c5fd7c0d30bcbb987948632
Author: hyukjinkwon <gu...@...>
Date:   2018-01-11T15:31:05Z

    Clarify the length of each series is of each batch within scalar Pandas UDF

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161114987
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column
    +           but of the batch internally used, and it is called for each batch. Therefore,
    +           this can be used, for example, to ensure the length of each returned `pandas.Series`
    +           but should not be used as the length of the whole input.
    --- End diff --
    
    How does this sound?  "..`pandas.Series`, and can not be used as the column length"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Merged to master and branch-2.3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by BryanCutler <gi...@git.apache.org>.

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161114750
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column
    +           but of the batch internally used, and it is called for each batch. Therefore,
    --- End diff --
    
    Does this sound a little better?  "..scalar UDF is not that of the whole input column, but is the length of an internal batch used for each call to the function."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    No, there are many other functions but this specific case could bring confusion as the length is not the length of the value and also not the length of the whole input column. In other cases, usually calling other functions on the pandas series produce expected results. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by icexelloss <gi...@git.apache.org>.

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    @HyukjinKwon Thanks! I think this is good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    **[Test build #86025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86025/testReport)** for PR 20237 at commit [`0fa39d0`](https://github.com/apache/spark/commit/0fa39d06d0a49773aee147da368e49d54ebd6e7b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161371952
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
    +           column, but is the length of an internal batch used for each call to the function.
    --- End diff --
    
    Nit: `but is` -> `but`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Hey @gatorsmile, @ueshin, @BryanCutler and @icexelloss. Let's fix this by clarifying it to avoid potential confusion for now and clear up SPARK-22216's subtasks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    **[Test build #85974 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)** for PR 20237 at commit [`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161115654
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not of the whole input column
    +           but of the batch internally used, and it is called for each batch. Therefore,
    --- End diff --
    
    Yup, English isn't really my area :(. Will try to incorporate your suggestion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Only `length` is the common functions? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    **[Test build #86025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86025/testReport)** for PR 20237 at commit [`0fa39d0`](https://github.com/apache/spark/commit/0fa39d06d0a49773aee147da368e49d54ebd6e7b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85974/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20237


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    Mind if I ask what you expect to fix @gatorsmile? It's clear and explans the results.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    **[Test build #85974 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85974/testReport)** for PR 20237 at commit [`d2cfed3`](https://github.com/apache/spark/commit/d2cfed308d343fb55c5fd7c0d30bcbb987948632).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of each se...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20237
  
    The newly added description is not clear to most Spark users. I think the descriptions added by this PR does not explain the common error cases pointed out in the JIRA.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161371988
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
    +           column, but is the length of an internal batch used for each call to the function.
    +           Therefore, this can be used, for example, to ensure the length of each returned
    --- End diff --
    
    `ensure`? Do you mean `measure`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20237#discussion_r161373555
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2184,6 +2184,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
            |         8|      JOHN DOE|          22|
            +----------+--------------+------------+
     
    +       .. note:: The length of `pandas.Series` within a scalar UDF is not that of the whole input
    +           column, but is the length of an internal batch used for each call to the function.
    +           Therefore, this can be used, for example, to ensure the length of each returned
    --- End diff --
    
    I meant to ensure the length of the batch because we declare "The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org