You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/10/02 08:25:19 UTC

[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/22610

    [WIP][SPARK-25461][PySpark][SQL] Print warning when return type of Pandas.Series mismatches the arrow return type of pandas udf

    ## What changes were proposed in this pull request?
    
    For Pandas UDFs, we get arrow type from defined Catalyst return data type of UDFs. We use this arrow type to do serialization of data. If the defined return data type doesn't match with actual return type of Pandas.Series returned by Pandas UDFs, it has a risk to return incorrect data from Python side.
    
    This WIP work proposes to check if returned Pandas.Series's dtype matches with defined return type of Pandas UDFs.
    
    Although we can disallow it by throwing an exception to let users know they might need to set correct return type. But looks like we leverage such behavior in current codebase. For example, there is a test `test_vectorized_udf_null_short`:
    
    ```python
    data = [(None,), (2,), (3,), (4,)]
    schema = StructType().add("short", ShortType())
    df = self.spark.createDataFrame(data, schema)
    short_f = pandas_udf(lambda x: x, ShortType())
    res = df.select(short_f(col('short')))
    self.assertEquals(df.collect(), res.collect())
    ```
    So instead, this work for now just prints warning message if such mismatching is detected. So users can read this message when debugging that their Pandas UDFs don't produce expected results.
    
    ## How was this patch tested?
    
    Manually test by running:
    
    ```python
    from pyspark.sql.functions import pandas_udf
    import pandas as pd
    
    values = [1.0] * 5 + [2.0] * 5
    pdf = pd.DataFrame({'A': values})
    df = spark.createDataFrame(pdf)
    @pandas_udf(returnType=BooleanType())
    def to_boolean(column):
        return column
    df.select(['A']).withColumn('to_boolean', to_boolean('A')).show()
    ```
    
    Output:
    
    ```
    WARN: Arrow type double of return Pandas.Series of the user-defined function's dtype float64 doesn't match the arrow type bool of defined return type B
    ooleanType                                                                                                  
    +---+----------+                                                                                                            
    |  A|to_boolean|                                                                                                    
    +---+----------+                                                                                            
    |1.0|     false|                                                                                            
    |1.0|     false|                                                                                                    
    |1.0|     false|                                                                                                                                      
    |1.0|     false|                                                                                                 
    |1.0|     false|                                                                                                                    
    |2.0|     false|                                                                                                                                       
    |2.0|     false|                                                                                                    
    |2.0|     false|                                                                                                    
    |2.0|     false|                                                                                            
    |2.0|     false|                                                                                                                                       
    +---+----------+  
    ```   

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-25461

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22610.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22610
    
----
commit 2fa15bda48ba64a102f114dc9119cb3c310200c4
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-09-26T09:01:40Z

    Ensure return type of Pandas.Series matches the arrow return type of pandas udf.

commit d206b7cf78f898e622f539a15e45515fcbd9e54a
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-10-02T05:29:44Z

    Print warning message instead of throwing exception.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96853/testReport)** for PR 22610 at commit [`c084e74`](https://github.com/apache/spark/commit/c084e745007d455a6ea99e10cc403b55ead6278d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223177785
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType. When there is mismatch between them, it is not guaranteed
    +        that the conversion by SparkSQL during serialization is correct at all and users might get
    --- End diff --
    
    Yeah, as actually we don't intentionally cast the returned data.
    
    How about:
    ```
    When there is mismatch between them, Spark might do conversion on returned data.
    The conversion is not guaranteed to be correct and results should be checked for accuracy by users.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96849 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96849/testReport)** for PR 22610 at commit [`d206b7c`](https://github.com/apache/spark/commit/d206b7cf78f898e622f539a15e45515fcbd9e54a).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223173637
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType. When there is mismatch between them, it is not guaranteed
    +        that the conversion by SparkSQL during serialization is correct at all and users might get
    --- End diff --
    
    >  an attempt will be made to cast the data and results should be checked for accuracy."
    
    it sounds like the casting is intentional. I think the casting logic is not that clear as far as I can tell, comparing SQL casting logic. Can we leave this not guaranteed for now and document the casting logic here instead? Does Arrow have some kind of documentation for type conversion BTW?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96935/testReport)** for PR 22610 at commit [`a756c0b`](https://github.com/apache/spark/commit/a756c0b40f74f35027d65a5c143bfa4b9f5f89fb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3622/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96849/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3743/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    > It is pretty new one, is it said we need to upgrade to latest PyArrow in order to use it? Since it is an option at Table.from_pandas, is it possible to extend it to pyarrow.Array?
    
    Yeah, it's part of pyarrow.Array now, but will only be in the 0.11.0 release so we would have to do it after the next upgrade.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Btw, I checked our `_minimum_pyarrow_version` is 0.8.0, so seems like even there is next upgrade available, for users with pyarrow versions before 0.11.0, this is still an potential issue. Isn't?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97044/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223070065
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType. When there is mismatch between them, it is not guaranteed
    +        that the conversion by SparkSQL during serialization is correct at all and users might get
    --- End diff --
    
    instead of saying "conversion is not guaranteed" which sounds like results might be arbitrary, could we say "..mismatch between them, an attempt will be made to cast the data and results should be checked for accuracy."?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [SPARK-25461][PySpark][SQL] Add document for mism...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223217249
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType (see :meth:`types.to_arrow_type` and
    +        :meth:`types.from_arrow_type`). When there is mismatch between them, Spark might do
    +        conversion on returned data. The conversion is not guaranteed to be correct and results
    +        should be checked for accuracy by users.
    --- End diff --
    
    I am merging this since this describes the current status but let's make it clear and try to get rid of this note within 3.0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [SPARK-25461][PySpark][SQL] Add document for mism...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r223217704
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType (see :meth:`types.to_arrow_type` and
    +        :meth:`types.from_arrow_type`). When there is mismatch between them, Spark might do
    +        conversion on returned data. The conversion is not guaranteed to be correct and results
    +        should be checked for accuracy by users.
    --- End diff --
    
    Yeah, I agreed. If there is next upgrade of PyArrow available, we may be able to provide the option to raise error when an unsafe cast.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [SPARK-25461][PySpark][SQL] Add document for mism...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22610


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    > Yeah, it's part of pyarrow.Array now, but will only be in the 0.11.0 release so we would have to do it after the next upgrade.
    
    Then I think we can wait for next upgrade to use this feature of pyarrow.Array and raise exception on unsafe cast.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Thanks @BryanCutler! Yes, this should not be a bug but is used as a warning to users that there might be some type conversion they are not noticed at first glance on the Pandas UDFs. For now the conversion is silently done behind the scene and as the case in the JIRA shows it might not be easily noticed that Pandas.Series from UDFs isn't matched with defined UDFs' return types.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96853/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    cc @HyukjinKwon Can you take a look at this when you have time? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    The idea sounds good to me from a cursory look for now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    One clear thing looks adding some documentation .. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222173904
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    +                      "dtype %s doesn't match the arrow type %s "
    +                      "of defined return type %s" % (arrow_type_of_result, result.dtype,
    +                                                     arrow_return_type, return_type),
    +                      file=sys.stderr)
    +        except:
    +            print("WARN: Can't infer arrow type of Pandas.Series's dtype: %s, which might not "
    +                  "match the arrow type %s of defined return type %s" % (result.dtype,
    +                                                                         arrow_return_type,
    +                                                                         return_type),
    --- End diff --
    
    ok. thanks. :-)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Yea, I will do this week. Sorry I missed the cc in the JIRA.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #97044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97044/testReport)** for PR 22610 at commit [`6c6f8a1`](https://github.com/apache/spark/commit/6c6f8a1abfc8150c0acf0b0c43fd8430d9f8e5c4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    So pyarrow just added an option when converting from Pandas to raise an error for unsafe casts. I'd have to try it out to see if it would prevent this case though.  It's a common option when working with Pandas, so users might be familiar with it and might be more useful to expose this as a Spark conf rather than checking the types.
    
    Btw, I'm working on fixing the float to boolean conversion here https://github.com/apache/arrow/pull/2698


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Thanks for looking into this @viirya !  I agree that there are lots of cases where casting to another type is intentional and works fine, so this isn't a bug. The only other idea I have is to provide an option to raise an error if the type needs to be cast. That might be possible with pyarrow right now, but I'm not sure how useful it would be.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222885910
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2909,6 +2909,11 @@ def pandas_udf(f=None, returnType=None, functionType=None):
             can fail on special rows, the workaround is to incorporate the condition into the functions.
     
         .. note:: The user-defined functions do not take keyword arguments on the calling side.
    +
    +    .. note:: The data type of returned `pandas.Series` from the user-defined functions should be
    +        matched with defined returnType. When there is mismatch between them, it is not guaranteed
    +        that the conversion by SparkSQL during serialization is correct at all and users might get
    --- End diff --
    
    maybe I am concerning too much .. but how about just say .. 
    
    ```
    ... defined returnType (see :meth:`types.to_arrow_type` and :meth:`types.from_arrow_type`). 
    When there is mismatch between them, the conversion is not guaranteed.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    > Thanks, @BryanCutler. WDYT about documenting the type map thing?
    
    I think that would help in the cases of dates/times because those can get a little confusing. For primitives, I think it's pretty straightforward, so I don't know how much that would help. Maybe it we just highlight some potential pitfalls?
    
    The problem here was that when a null value was introduced, Pandas automatically converted the data to float to insert a NaN value, then the Arrow conversion from float to bool is broken. When the data just had ints, the conversion seems ok, so it ended up giving inconsistent confusing results.  Not sure what might have helped here, it's just a nasty bug :)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    @HyukjinKwon Thanks!
    
    I agree that having document of this is definitely useful. I will try to add it and let's see if it is ok for you. I think it is good to mention that users are responsible for ensuring return type of Pandas UDF matches defined return type. The mapping is good reference to show in the document too.
    
    
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222885267
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    --- End diff --
    
    Yes .. I support to just fix the doc first here only and make a PR separately later if needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    So I've added a bit document for this. @HyukjinKwon @BryanCutler please check it when you have time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    I think it is more reasonable to use the option when converting from Pandas to raise an error for unsafe casts. It should be better than to display warning message.
    
    Not sure how long before next upgrade, do you think we should add some words into document to explain this pitfalls especially? Or we just leave it until next upgrade? @HyukjinKwon @BryanCutler 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3626/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222616380
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    +                      "dtype %s doesn't match the arrow type %s "
    +                      "of defined return type %s" % (arrow_type_of_result, result.dtype,
    +                                                     arrow_return_type, return_type),
    +                      file=sys.stderr)
    +        except:
    +            print("WARN: Can't infer arrow type of Pandas.Series's dtype: %s, which might not "
    +                  "match the arrow type %s of defined return type %s" % (result.dtype,
    +                                                                         arrow_return_type,
    +                                                                         return_type),
    --- End diff --
    
    Sorry I may misunderstand, do you mean L113 and L114 should be aligned with L112? But after that, lint-python will complain. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222617651
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    --- End diff --
    
    hmm, I'm neutral on whether we should display this warning message, before we have an option to check the unsafe conversion by PyArrow. @HyukjinKwon if you are also supportive, I will remove this and leave this PR as documentation only.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96935/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    > he only other idea I have is to provide an option to raise an error if the type needs to be cast.
    
    Actually sounds good to me.
    
    I think the problem is we are not quite clear about when the type is mismatched in UDFs (see also https://github.com/apache/spark/pull/20163 for a reminder). IIRC, we rather roughly agreed upon documenting it (and allowing exact type match (?)).
    
    @viirya and @BryanCutler, how about we document that return types should be matched (we can leave a chart or map referring (`types.to_arrow_type`)?
    
    One additional improvement might be .. we describe that type casting behaviour is .. say .. not guaranteed but I am not sure how we can nicely document this. Probably only mentioning the type mapping is fine ..? 



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96849 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96849/testReport)** for PR 22610 at commit [`d206b7c`](https://github.com/apache/spark/commit/d206b7cf78f898e622f539a15e45515fcbd9e54a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96853/testReport)** for PR 22610 at commit [`c084e74`](https://github.com/apache/spark/commit/c084e745007d455a6ea99e10cc403b55ead6278d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #96935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96935/testReport)** for PR 22610 at commit [`a756c0b`](https://github.com/apache/spark/commit/a756c0b40f74f35027d65a5c143bfa4b9f5f89fb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222173421
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    +                      "dtype %s doesn't match the arrow type %s "
    +                      "of defined return type %s" % (arrow_type_of_result, result.dtype,
    +                                                     arrow_return_type, return_type),
    +                      file=sys.stderr)
    +        except:
    +            print("WARN: Can't infer arrow type of Pandas.Series's dtype: %s, which might not "
    +                  "match the arrow type %s of defined return type %s" % (result.dtype,
    +                                                                         arrow_return_type,
    +                                                                         return_type),
    --- End diff --
    
    I would fix the indentation here tho :-)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222501309
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    --- End diff --
    
    Yeah, it might be useful to see the warning if doing some local tests etc.  My only concern is that users might be confused why they see a warning locally, but doesn't appear in logs.. Man, it would be nice to have some proper python logging for this!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222015287
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    --- End diff --
    
    No, but as the other print usage in `worker.py`, I think this can be seen in the worker log?
    
    This is also useful when testing in pyspark shell.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [SPARK-25461][PySpark][SQL] Add document for mismatch be...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    **[Test build #97044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97044/testReport)** for PR 22610 at commit [`6c6f8a1`](https://github.com/apache/spark/commit/6c6f8a1abfc8150c0acf0b0c43fd8430d9f8e5c4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3675/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Thanks, @BryanCutler. WDYT about documenting the type map thing?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning wh...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22610#discussion_r222007837
  
    --- Diff: python/pyspark/worker.py ---
    @@ -84,13 +84,36 @@ def wrap_scalar_pandas_udf(f, return_type):
         arrow_return_type = to_arrow_type(return_type)
     
         def verify_result_length(*a):
    +        import pyarrow as pa
             result = f(*a)
             if not hasattr(result, "__len__"):
                 raise TypeError("Return type of the user-defined function should be "
                                 "Pandas.Series, but is {}".format(type(result)))
             if len(result) != len(a[0]):
                 raise RuntimeError("Result vector from pandas_udf was not the required length: "
                                    "expected %d, got %d" % (len(a[0]), len(result)))
    +
    +        # Ensure return type of Pandas.Series matches the arrow return type of the user-defined
    +        # function. Otherwise, we may produce incorrect serialized data.
    +        # Note: for timestamp type, we only need to ensure both types are timestamp because the
    +        # serializer will do conversion.
    +        try:
    +            arrow_type_of_result = pa.from_numpy_dtype(result.dtype)
    +            both_are_timestamp = pa.types.is_timestamp(arrow_type_of_result) and \
    +                pa.types.is_timestamp(arrow_return_type)
    +            if not both_are_timestamp and arrow_return_type != arrow_type_of_result:
    +                print("WARN: Arrow type %s of return Pandas.Series of the user-defined function's "
    --- End diff --
    
    Will this appear when being run in an executor?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22610: [WIP][SPARK-25461][PySpark][SQL] Print warning when retu...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22610
  
    Thanks @BryanCutler! Looks like an useful option. It is pretty new one, is it said we need to upgrade to latest PyArrow in order to use it? Since it is an option at `Table.from_pandas`, is it possible to extend it to `pyarrow.Array`?  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org