You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/02/07 13:36:54 UTC

[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20531

    [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs

    ## What changes were proposed in this pull request?
    
    This PR targets to explicitly specify supported types in Pandas UDFs.
    The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.
    
    1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:
    
        ```python
        from pyspark.sql.functions import pandas_udf
        pudf = pandas_udf(lambda x: x, "binary")
        df = spark.createDataFrame([[bytearray("a")]])
        df.select(pudf("_1")).show()
        ```
        ```
        ...
        TypeError: Unsupported type in conversion to Arrow: BinaryType
        ```
    
        We can document this behaviour for its guide.
    
    2. Also, the grouped aggregate Pandas UDF fail fast on `ArrayType` but seems we can support this case.
    
        ```python
        from pyspark.sql.functions import pandas_udf, PandasUDFType
        foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
        df = spark.range(100).selectExpr("id", "array(id) as value")
        df.groupBy("id").agg(foo("value")).show()
        ```
    
        ```
        ...
         NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG
        ```
    
    3. Since we can check the return type ahead, we can fail fast before actual execution.
    
        ```python
        # we can fail fast at this stage because we know the schema ahead
        pandas_udf(lambda x: x, BinaryType())
        ```
    
    ## How was this patch tested?
    
    Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark pudf-cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20531
    
----
commit ec708d58001be1382cabbe4357cbc68e2d51a8b6
Author: hyukjinkwon <gu...@...>
Date:   2018-02-07T00:18:20Z

    Explicitly specify supported types with Pandas UDFs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87280/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87325/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Yup, let me try to address them tomorrow. Thanks for your review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87164/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166645833
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
     
     ### Supported SQL Types
     
    -Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
    +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
    --- End diff --
    
    I thought binary type is supported... I am curious, what's the reason that it doesn't work now?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166655472
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -112,15 +112,31 @@ def returnType(self):
                 else:
                     self._returnType_placeholder = _parse_datatype_string(self._returnType)
     
    -        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
    -                and not isinstance(self._returnType_placeholder, StructType):
    -            raise ValueError("Invalid returnType: returnType must be a StructType for "
    -                             "pandas_udf with function type GROUPED_MAP")
    -        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
    -                and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
    -            raise NotImplementedError(
    -                "ArrayType, StructType and MapType are not supported with "
    -                "PandasUDFType.GROUPED_AGG")
    +        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
    +            if isinstance(self._returnType_placeholder, StructType):
    +                try:
    +                    to_arrow_schema(self._returnType_placeholder)
    +                except TypeError:
    +                    raise NotImplementedError(
    +                        "Invalid returnType with a grouped map Pandas UDF: "
    +                        "%s is not supported" % str(self._returnType_placeholder))
    +            else:
    +                raise TypeError("Invalid returnType for a grouped map Pandas "
    --- End diff --
    
    nit: `a grouped map Pandas UDF` -> `grouped map Pandas UDFs?`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87214/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166650891
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
     
     ### Supported SQL Types
     
    -Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
    +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
    --- End diff --
    
    I was under impression that we don't support this. Seems Arrow doesn't work consistently with what Spark does. I think it's actually related with https://github.com/apache/spark/pull/20507.
    
    I am careful to say this out but I believe the root cause is how to handle `str` in Python 2. Technically, it's bytes but named string. As you might already know, due to this confusion, `unicode` became `str` and `str` became `bytes` in Python 3. Spark handles this as `StringType` in general whereas seems Arrow deals with binaries. 
    
    I think we shouldn't support this for now until we get the consistent behaviour.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166653988
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    For arrays, can we add tests for:
    * Test type coercion, e.g., specified type is `array<double>` and returned array is `[0, 1, 2]`?
    * Test exception: function returns array of different types like `[0, "hello"]`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87160/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166654955
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4562,14 +4585,14 @@ def test_basic(self):
             self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
     
         def test_unsupported_types(self):
    -        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.types import DoubleType, MapType
             from pyspark.sql.functions import pandas_udf, PandasUDFType
     
             with QuietTest(self.sc):
                 with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    -                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
    +                @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
    --- End diff --
    
    Seems because we don't handle the timezone issue when it's nested. There are few todos, for example:
    
    https://github.com/apache/spark/blob/71cfba04aeec5ae9b85a507b13996e80f8750edc/python/pyspark/sql/session.py#L465
    
    https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1726
    
    https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1745
    
    https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1771



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166619371
  
    --- Diff: python/pyspark/worker.py ---
    @@ -116,7 +116,7 @@ def wrap_grouped_agg_pandas_udf(f, return_type):
         def wrapped(*series):
             import pandas as pd
             result = f(*series)
    -        return pd.Series(result)
    +        return pd.Series([result])
    --- End diff --
    
    This change seems to be required:
    
    ```python
    >>> import numpy as np
    >>> import pandas as pd
    >>> pd.Series(np.array([1, 2, 3]))
    0    1
    1    2
    2    3
    dtype: int64
    >>> pd.Series([np.array([1, 2, 3])])
    0    [1, 2, 3]
    dtype: object
    >>> pd.Series(1)
    0    1
    dtype: int64
    >>> pd.Series([1])
    0    1
    dtype: int64
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166831572
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -112,15 +112,31 @@ def returnType(self):
                 else:
                     self._returnType_placeholder = _parse_datatype_string(self._returnType)
     
    -        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
    -                and not isinstance(self._returnType_placeholder, StructType):
    -            raise ValueError("Invalid returnType: returnType must be a StructType for "
    -                             "pandas_udf with function type GROUPED_MAP")
    -        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
    -                and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
    -            raise NotImplementedError(
    -                "ArrayType, StructType and MapType are not supported with "
    -                "PandasUDFType.GROUPED_AGG")
    +        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
    --- End diff --
    
    nit: I'd prefer to keep the check order by the definition in `PythonEvalType` if you don't have a special reason.
    
    E.g.,
    
    ```
    if self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
        ...
    elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
        ...
    elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
        ...
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/713/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/804/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/670/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/669/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166654101
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    Btw, we can have this to be a follow up and I can do it too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87328/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87329/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/767/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166650642
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -112,15 +112,31 @@ def returnType(self):
                 else:
                     self._returnType_placeholder = _parse_datatype_string(self._returnType)
     
    -        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
    -                and not isinstance(self._returnType_placeholder, StructType):
    -            raise ValueError("Invalid returnType: returnType must be a StructType for "
    -                             "pandas_udf with function type GROUPED_MAP")
    -        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
    -                and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
    -            raise NotImplementedError(
    -                "ArrayType, StructType and MapType are not supported with "
    -                "PandasUDFType.GROUPED_AGG")
    +        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
    +            if isinstance(self._returnType_placeholder, StructType):
    +                try:
    +                    to_arrow_schema(self._returnType_placeholder)
    +                except TypeError:
    +                    raise NotImplementedError(
    +                        "Invalid returnType with a grouped map Pandas UDF: "
    --- End diff --
    
    nit: `a grouped map Pandas UDF` -> `grouped map Pandas UDFs`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87214/testReport)** for PR 20531 at commit [`36617e4`](https://github.com/apache/spark/commit/36617e4bd864e0fbca5c617d009de45a8231a5d6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87328/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166660050
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    I think with Pandas UDFs, certain type coercion is supported, e.g., when user specify "double type" and returns a `pd.Series` of `int`, it will automatically cast it to `pd.Series` of double. This behavior is different from regular Python UDF, which will return null in this case. Most of the type coercion is done by pyarrow. (Btw, I think type coercion in Pandas UDFs is an huge improvement over Python UDF because that's one of the biggest frustration our PySpark users have...)
    
     
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87280/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166649880
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4562,14 +4585,14 @@ def test_basic(self):
             self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
     
         def test_unsupported_types(self):
    -        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.types import DoubleType, MapType
             from pyspark.sql.functions import pandas_udf, PandasUDFType
     
             with QuietTest(self.sc):
                 with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    -                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
    +                @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
    --- End diff --
    
    Why is `ArrayType(TimestampType())` a special case that is not supported? (I haven't fully tested this when implementing this feature, is only array of primitives supported?)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87329/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87214/testReport)** for PR 20531 at commit [`36617e4`](https://github.com/apache/spark/commit/36617e4bd864e0fbca5c617d009de45a8231a5d6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87164/testReport)** for PR 20531 at commit [`68662ec`](https://github.com/apache/spark/commit/68662ec1a08beb0dad3c4f9a20f060ccbb236a3a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166648140
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4289,14 +4297,15 @@ def data(self):
                 .withColumn("v", explode(col('vs'))).drop('vs')
     
         def test_simple(self):
    -        from pyspark.sql.functions import pandas_udf, PandasUDFType
    -        df = self.data
    +        from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col
    +        df = self.data.withColumn("arr", array(col("id")))
    --- End diff --
    
    minor: It seems a bit arbitrary to mix array type in this test. Array probably belongs to `test_array`, `test_complex_type` sth like `test_all_types`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20531


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87164/testReport)** for PR 20531 at commit [`68662ec`](https://github.com/apache/spark/commit/68662ec1a08beb0dad3c4f9a20f060ccbb236a3a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87281/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87328/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/806/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87163/testReport)** for PR 20531 at commit [`7cbebaf`](https://github.com/apache/spark/commit/7cbebaf1ad20c4d461a19d4d84b2990676e566ac).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166655962
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -112,15 +112,31 @@ def returnType(self):
                 else:
                     self._returnType_placeholder = _parse_datatype_string(self._returnType)
     
    -        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
    -                and not isinstance(self._returnType_placeholder, StructType):
    -            raise ValueError("Invalid returnType: returnType must be a StructType for "
    -                             "pandas_udf with function type GROUPED_MAP")
    -        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
    -                and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
    -            raise NotImplementedError(
    -                "ArrayType, StructType and MapType are not supported with "
    -                "PandasUDFType.GROUPED_AGG")
    +        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
    +            if isinstance(self._returnType_placeholder, StructType):
    +                try:
    +                    to_arrow_schema(self._returnType_placeholder)
    +                except TypeError:
    +                    raise NotImplementedError(
    +                        "Invalid returnType with a grouped map Pandas UDF: "
    +                        "%s is not supported" % str(self._returnType_placeholder))
    +            else:
    +                raise TypeError("Invalid returnType for a grouped map Pandas "
    +                                "UDF: returnType must be a StructType.")
    +        elif self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
    +            try:
    +                to_arrow_type(self._returnType_placeholder)
    +            except TypeError:
    +                raise NotImplementedError(
    +                    "Invalid returnType with a scalar Pandas UDF: %s is "
    +                    "not supported" % str(self._returnType_placeholder))
    +        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
    +            try:
    +                to_arrow_type(self._returnType_placeholder)
    +            except TypeError:
    +                raise NotImplementedError(
    +                    "Invalid returnType with a grouped aggregate Pandas UDF: "
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/766/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/667/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87160/testReport)** for PR 20531 at commit [`ec708d5`](https://github.com/apache/spark/commit/ec708d58001be1382cabbe4357cbc68e2d51a8b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    cc @BryanCutler, @icexelloss and @ueshin, mind taking a look please?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166655050
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
     
     ### Supported SQL Types
     
    -Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
    +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
    --- End diff --
    
    I see. Thanks for the explanation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167026621
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1676,7 +1676,7 @@ Using the above optimizations with Arrow will produce the same results as when A
     enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the
     DataFrame to the driver program and should be done on a small subset of the data. Not all Spark
     data types are currently supported and an error can be raised if a column has an unsupported type,
    -see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
    +see [Supported SQL Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
    --- End diff --
    
    Nice catch!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    @ueshin and @icexelloss, thanks for your review. I tried to address the comments at my best.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166646442
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -3715,10 +3715,10 @@ def foo(x):
             self.assertEqual(foo.returnType, schema)
             self.assertEqual(foo.evalType, PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF)
     
    -        @pandas_udf(returnType='v double', functionType=PandasUDFType.SCALAR)
    +        @pandas_udf(returnType='double', functionType=PandasUDFType.SCALAR)
             def foo(x):
                 return x
    -        self.assertEqual(foo.returnType, schema)
    +        self.assertEqual(foo.returnType, schema[0].dataType)
    --- End diff --
    
    Should we just:
    ```
    self.assertEqual(foo.returnType, DoubleType())
    ```
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87325/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    @HyukjinKwon Yes, LGTM.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167732281
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1638,6 +1638,8 @@ def to_arrow_type(dt):
             # Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read
             arrow_type = pa.timestamp('us', tz='UTC')
         elif type(dt) == ArrayType:
    +        if type(dt.elementType) == TimestampType:
    +            raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
    --- End diff --
    
    I think timestamps with localisation issue. See https://github.com/apache/spark/pull/20531#discussion_r166649880.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    @ueshin, does this looks fine to you too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87325/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87280/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166832153
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4562,14 +4585,14 @@ def test_basic(self):
             self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
     
         def test_unsupported_types(self):
    -        from pyspark.sql.types import ArrayType, DoubleType, MapType
    +        from pyspark.sql.types import DoubleType, MapType
             from pyspark.sql.functions import pandas_udf, PandasUDFType
     
             with QuietTest(self.sc):
                 with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
    -                @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
    +                @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
                     def mean_and_std_udf(v):
    --- End diff --
    
    nit: should rename this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166656777
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    If you meant to type coercion (did I understand correctly?), I already tested. Seems not working properly. Similar thing was discussed in https://github.com/apache/spark/pull/20163#issuecomment-356231655 (thanks @ueshin).
    
    Will reread the comments when I am more awake tomorrow ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167024836
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
     
     ### Supported SQL Types
     
    -Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
    +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
    --- End diff --
    
    I agree, we need to look into these details more before we can support this type


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87163/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87281/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167391819
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    Hm .. let's do it separately for type coercion stuff in another issue. I think we need another iteration for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166661658
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
             return weighted_mean
     
         def test_manual(self):
    +        from pyspark.sql.functions import pandas_udf, array
    +
             df = self.data
             sum_udf = self.pandas_agg_sum_udf
             mean_udf = self.pandas_agg_mean_udf
    -
    -        result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
    +        mean_arr_udf = pandas_udf(
    +            self.pandas_agg_mean_udf.func,
    --- End diff --
    
    Btw, if type coercion is not working with array type, I think it's still fine to allow using array type and fix type coercion separately.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Thank you for reviewing this, @icexelloss, @ueshin and @BryanCutler.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r166655615
  
    --- Diff: python/pyspark/sql/udf.py ---
    @@ -112,15 +112,31 @@ def returnType(self):
                 else:
                     self._returnType_placeholder = _parse_datatype_string(self._returnType)
     
    -        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
    -                and not isinstance(self._returnType_placeholder, StructType):
    -            raise ValueError("Invalid returnType: returnType must be a StructType for "
    -                             "pandas_udf with function type GROUPED_MAP")
    -        elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
    -                and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
    -            raise NotImplementedError(
    -                "ArrayType, StructType and MapType are not supported with "
    -                "PandasUDFType.GROUPED_AGG")
    +        if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
    +            if isinstance(self._returnType_placeholder, StructType):
    +                try:
    +                    to_arrow_schema(self._returnType_placeholder)
    +                except TypeError:
    +                    raise NotImplementedError(
    +                        "Invalid returnType with a grouped map Pandas UDF: "
    +                        "%s is not supported" % str(self._returnType_placeholder))
    +            else:
    +                raise TypeError("Invalid returnType for a grouped map Pandas "
    +                                "UDF: returnType must be a StructType.")
    +        elif self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
    +            try:
    +                to_arrow_type(self._returnType_placeholder)
    +            except TypeError:
    +                raise NotImplementedError(
    +                    "Invalid returnType with a scalar Pandas UDF: %s is "
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87160/testReport)** for PR 20531 at commit [`ec708d5`](https://github.com/apache/spark/commit/ec708d58001be1382cabbe4357cbc68e2d51a8b6).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87163/testReport)** for PR 20531 at commit [`7cbebaf`](https://github.com/apache/spark/commit/7cbebaf1ad20c4d461a19d4d84b2990676e566ac).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87281/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    **[Test build #87329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87329/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    @HyukjinKwon Looks good to me at high level. Left some comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    @HyukjinKwon LGTM! My only comment left is https://github.com/apache/spark/pull/20531#discussion_r166653988 . But we can have another PR for testing type coercion with array type.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20531
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167029101
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
     
     ### Supported SQL Types
     
    -Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
    +Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
    --- End diff --
    
    should `BinaryType` be added to the unsupported types with arrow.enabled in SQLConf.scala?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20531#discussion_r167718299
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -1638,6 +1638,8 @@ def to_arrow_type(dt):
             # Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read
             arrow_type = pa.timestamp('us', tz='UTC')
         elif type(dt) == ArrayType:
    +        if type(dt.elementType) == TimestampType:
    +            raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
    --- End diff --
    
    What is the behavior before this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org