You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by HyukjinKwon <gi...@git.apache.org> on 2018/02/07 13:36:54 UTC
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/20531
[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs
## What changes were proposed in this pull request?
This PR targets to explicitly specify supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things.
1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see:
```python
from pyspark.sql.functions import pandas_udf
pudf = pandas_udf(lambda x: x, "binary")
df = spark.createDataFrame([[bytearray("a")]])
df.select(pudf("_1")).show()
```
```
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
```
We can document this behaviour for its guide.
2. Also, the grouped aggregate Pandas UDF fail fast on `ArrayType` but seems we can support this case.
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG)
df = spark.range(100).selectExpr("id", "array(id) as value")
df.groupBy("id").agg(foo("value")).show()
```
```
...
NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG
```
3. Since we can check the return type ahead, we can fail fast before actual execution.
```python
# we can fail fast at this stage because we know the schema ahead
pandas_udf(lambda x: x, BinaryType())
```
## How was this patch tested?
Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark pudf-cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20531.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20531
----
commit ec708d58001be1382cabbe4357cbc68e2d51a8b6
Author: hyukjinkwon <gu...@...>
Date: 2018-02-07T00:18:20Z
Explicitly specify supported types with Pandas UDFs
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87280/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87325/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
Yup, let me try to address them tomorrow. Thanks for your review.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87164/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166645833
--- Diff: docs/sql-programming-guide.md ---
@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
### Supported SQL Types
-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
--- End diff --
I thought binary type is supported... I am curious, what's the reason that it doesn't work now?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166655472
--- Diff: python/pyspark/sql/udf.py ---
@@ -112,15 +112,31 @@ def returnType(self):
else:
self._returnType_placeholder = _parse_datatype_string(self._returnType)
- if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
- and not isinstance(self._returnType_placeholder, StructType):
- raise ValueError("Invalid returnType: returnType must be a StructType for "
- "pandas_udf with function type GROUPED_MAP")
- elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
- and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
- raise NotImplementedError(
- "ArrayType, StructType and MapType are not supported with "
- "PandasUDFType.GROUPED_AGG")
+ if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
+ if isinstance(self._returnType_placeholder, StructType):
+ try:
+ to_arrow_schema(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a grouped map Pandas UDF: "
+ "%s is not supported" % str(self._returnType_placeholder))
+ else:
+ raise TypeError("Invalid returnType for a grouped map Pandas "
--- End diff --
nit: `a grouped map Pandas UDF` -> `grouped map Pandas UDFs?`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87214/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166650891
--- Diff: docs/sql-programming-guide.md ---
@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
### Supported SQL Types
-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
--- End diff --
I was under impression that we don't support this. Seems Arrow doesn't work consistently with what Spark does. I think it's actually related with https://github.com/apache/spark/pull/20507.
I am careful to say this out but I believe the root cause is how to handle `str` in Python 2. Technically, it's bytes but named string. As you might already know, due to this confusion, `unicode` became `str` and `str` became `bytes` in Python 3. Spark handles this as `StringType` in general whereas seems Arrow deals with binaries.
I think we shouldn't support this for now until we get the consistent behaviour.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166653988
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
For arrays, can we add tests for:
* Test type coercion, e.g., specified type is `array<double>` and returned array is `[0, 1, 2]`?
* Test exception: function returns array of different types like `[0, "hello"]`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87160/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166654955
--- Diff: python/pyspark/sql/tests.py ---
@@ -4562,14 +4585,14 @@ def test_basic(self):
self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
def test_unsupported_types(self):
- from pyspark.sql.types import ArrayType, DoubleType, MapType
+ from pyspark.sql.types import DoubleType, MapType
from pyspark.sql.functions import pandas_udf, PandasUDFType
with QuietTest(self.sc):
with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
- @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
+ @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
--- End diff --
Seems because we don't handle the timezone issue when it's nested. There are few todos, for example:
https://github.com/apache/spark/blob/71cfba04aeec5ae9b85a507b13996e80f8750edc/python/pyspark/sql/session.py#L465
https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1726
https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1745
https://github.com/apache/spark/blob/a24c03138a6935a442b983c8a4c721b26df3f9e2/python/pyspark/sql/types.py#L1771
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166619371
--- Diff: python/pyspark/worker.py ---
@@ -116,7 +116,7 @@ def wrap_grouped_agg_pandas_udf(f, return_type):
def wrapped(*series):
import pandas as pd
result = f(*series)
- return pd.Series(result)
+ return pd.Series([result])
--- End diff --
This change seems to be required:
```python
>>> import numpy as np
>>> import pandas as pd
>>> pd.Series(np.array([1, 2, 3]))
0 1
1 2
2 3
dtype: int64
>>> pd.Series([np.array([1, 2, 3])])
0 [1, 2, 3]
dtype: object
>>> pd.Series(1)
0 1
dtype: int64
>>> pd.Series([1])
0 1
dtype: int64
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166831572
--- Diff: python/pyspark/sql/udf.py ---
@@ -112,15 +112,31 @@ def returnType(self):
else:
self._returnType_placeholder = _parse_datatype_string(self._returnType)
- if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
- and not isinstance(self._returnType_placeholder, StructType):
- raise ValueError("Invalid returnType: returnType must be a StructType for "
- "pandas_udf with function type GROUPED_MAP")
- elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
- and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
- raise NotImplementedError(
- "ArrayType, StructType and MapType are not supported with "
- "PandasUDFType.GROUPED_AGG")
+ if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
--- End diff --
nit: I'd prefer to keep the check order by the definition in `PythonEvalType` if you don't have a special reason.
E.g.,
```
if self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
...
elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
...
elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
...
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/713/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/804/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/670/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/669/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166654101
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
Btw, we can have this to be a follow up and I can do it too
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87328/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87329/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/767/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166650642
--- Diff: python/pyspark/sql/udf.py ---
@@ -112,15 +112,31 @@ def returnType(self):
else:
self._returnType_placeholder = _parse_datatype_string(self._returnType)
- if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
- and not isinstance(self._returnType_placeholder, StructType):
- raise ValueError("Invalid returnType: returnType must be a StructType for "
- "pandas_udf with function type GROUPED_MAP")
- elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
- and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
- raise NotImplementedError(
- "ArrayType, StructType and MapType are not supported with "
- "PandasUDFType.GROUPED_AGG")
+ if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
+ if isinstance(self._returnType_placeholder, StructType):
+ try:
+ to_arrow_schema(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a grouped map Pandas UDF: "
--- End diff --
nit: `a grouped map Pandas UDF` -> `grouped map Pandas UDFs`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87214/testReport)** for PR 20531 at commit [`36617e4`](https://github.com/apache/spark/commit/36617e4bd864e0fbca5c617d009de45a8231a5d6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87328/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166660050
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
I think with Pandas UDFs, certain type coercion is supported, e.g., when user specify "double type" and returns a `pd.Series` of `int`, it will automatically cast it to `pd.Series` of double. This behavior is different from regular Python UDF, which will return null in this case. Most of the type coercion is done by pyarrow. (Btw, I think type coercion in Pandas UDFs is an huge improvement over Python UDF because that's one of the biggest frustration our PySpark users have...)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87280/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166649880
--- Diff: python/pyspark/sql/tests.py ---
@@ -4562,14 +4585,14 @@ def test_basic(self):
self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
def test_unsupported_types(self):
- from pyspark.sql.types import ArrayType, DoubleType, MapType
+ from pyspark.sql.types import DoubleType, MapType
from pyspark.sql.functions import pandas_udf, PandasUDFType
with QuietTest(self.sc):
with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
- @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
+ @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
--- End diff --
Why is `ArrayType(TimestampType())` a special case that is not supported? (I haven't fully tested this when implementing this feature, is only array of primitives supported?)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87329/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87214/testReport)** for PR 20531 at commit [`36617e4`](https://github.com/apache/spark/commit/36617e4bd864e0fbca5c617d009de45a8231a5d6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87164/testReport)** for PR 20531 at commit [`68662ec`](https://github.com/apache/spark/commit/68662ec1a08beb0dad3c4f9a20f060ccbb236a3a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166648140
--- Diff: python/pyspark/sql/tests.py ---
@@ -4289,14 +4297,15 @@ def data(self):
.withColumn("v", explode(col('vs'))).drop('vs')
def test_simple(self):
- from pyspark.sql.functions import pandas_udf, PandasUDFType
- df = self.data
+ from pyspark.sql.functions import pandas_udf, PandasUDFType, array, col
+ df = self.data.withColumn("arr", array(col("id")))
--- End diff --
minor: It seems a bit arbitrary to mix array type in this test. Array probably belongs to `test_array`, `test_complex_type` sth like `test_all_types`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/20531
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87164/testReport)** for PR 20531 at commit [`68662ec`](https://github.com/apache/spark/commit/68662ec1a08beb0dad3c4f9a20f060ccbb236a3a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87281/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87328/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/806/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87163/testReport)** for PR 20531 at commit [`7cbebaf`](https://github.com/apache/spark/commit/7cbebaf1ad20c4d461a19d4d84b2990676e566ac).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166655962
--- Diff: python/pyspark/sql/udf.py ---
@@ -112,15 +112,31 @@ def returnType(self):
else:
self._returnType_placeholder = _parse_datatype_string(self._returnType)
- if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
- and not isinstance(self._returnType_placeholder, StructType):
- raise ValueError("Invalid returnType: returnType must be a StructType for "
- "pandas_udf with function type GROUPED_MAP")
- elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
- and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
- raise NotImplementedError(
- "ArrayType, StructType and MapType are not supported with "
- "PandasUDFType.GROUPED_AGG")
+ if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
+ if isinstance(self._returnType_placeholder, StructType):
+ try:
+ to_arrow_schema(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a grouped map Pandas UDF: "
+ "%s is not supported" % str(self._returnType_placeholder))
+ else:
+ raise TypeError("Invalid returnType for a grouped map Pandas "
+ "UDF: returnType must be a StructType.")
+ elif self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
+ try:
+ to_arrow_type(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a scalar Pandas UDF: %s is "
+ "not supported" % str(self._returnType_placeholder))
+ elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF:
+ try:
+ to_arrow_type(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a grouped aggregate Pandas UDF: "
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/766/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/667/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87160/testReport)** for PR 20531 at commit [`ec708d5`](https://github.com/apache/spark/commit/ec708d58001be1382cabbe4357cbc68e2d51a8b6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
cc @BryanCutler, @icexelloss and @ueshin, mind taking a look please?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166655050
--- Diff: docs/sql-programming-guide.md ---
@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
### Supported SQL Types
-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
--- End diff --
I see. Thanks for the explanation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167026621
--- Diff: docs/sql-programming-guide.md ---
@@ -1676,7 +1676,7 @@ Using the above optimizations with Arrow will produce the same results as when A
enabled. Note that even with Arrow, `toPandas()` results in the collection of all records in the
DataFrame to the driver program and should be done on a small subset of the data. Not all Spark
data types are currently supported and an error can be raised if a column has an unsupported type,
-see [Supported Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
+see [Supported SQL Types](#supported-sql-arrow-types). If an error occurs during `createDataFrame()`,
--- End diff --
Nice catch!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
@ueshin and @icexelloss, thanks for your review. I tried to address the comments at my best.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166646442
--- Diff: python/pyspark/sql/tests.py ---
@@ -3715,10 +3715,10 @@ def foo(x):
self.assertEqual(foo.returnType, schema)
self.assertEqual(foo.evalType, PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF)
- @pandas_udf(returnType='v double', functionType=PandasUDFType.SCALAR)
+ @pandas_udf(returnType='double', functionType=PandasUDFType.SCALAR)
def foo(x):
return x
- self.assertEqual(foo.returnType, schema)
+ self.assertEqual(foo.returnType, schema[0].dataType)
--- End diff --
Should we just:
```
self.assertEqual(foo.returnType, DoubleType())
```
?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87325/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/20531
@HyukjinKwon Yes, LGTM.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167732281
--- Diff: python/pyspark/sql/types.py ---
@@ -1638,6 +1638,8 @@ def to_arrow_type(dt):
# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read
arrow_type = pa.timestamp('us', tz='UTC')
elif type(dt) == ArrayType:
+ if type(dt.elementType) == TimestampType:
+ raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
--- End diff --
I think timestamps with localisation issue. See https://github.com/apache/spark/pull/20531#discussion_r166649880.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
@ueshin, does this looks fine to you too?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87325/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87280 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87280/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166832153
--- Diff: python/pyspark/sql/tests.py ---
@@ -4562,14 +4585,14 @@ def test_basic(self):
self.assertPandasEqual(expected4.toPandas(), result4.toPandas())
def test_unsupported_types(self):
- from pyspark.sql.types import ArrayType, DoubleType, MapType
+ from pyspark.sql.types import DoubleType, MapType
from pyspark.sql.functions import pandas_udf, PandasUDFType
with QuietTest(self.sc):
with self.assertRaisesRegexp(NotImplementedError, 'not supported'):
- @pandas_udf(ArrayType(DoubleType()), PandasUDFType.GROUPED_AGG)
+ @pandas_udf(ArrayType(ArrayType(TimestampType())), PandasUDFType.GROUPED_AGG)
def mean_and_std_udf(v):
--- End diff --
nit: should rename this?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166656777
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
If you meant to type coercion (did I understand correctly?), I already tested. Seems not working properly. Similar thing was discussed in https://github.com/apache/spark/pull/20163#issuecomment-356231655 (thanks @ueshin).
Will reread the comments when I am more awake tomorrow ...
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167024836
--- Diff: docs/sql-programming-guide.md ---
@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
### Supported SQL Types
-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
--- End diff --
I agree, we need to look into these details more before we can support this type
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87163/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87281/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167391819
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
Hm .. let's do it separately for type coercion stuff in another issue. I think we need another iteration for it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166661658
--- Diff: python/pyspark/sql/tests.py ---
@@ -4509,23 +4523,32 @@ def weighted_mean(v, w):
return weighted_mean
def test_manual(self):
+ from pyspark.sql.functions import pandas_udf, array
+
df = self.data
sum_udf = self.pandas_agg_sum_udf
mean_udf = self.pandas_agg_mean_udf
-
- result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id')
+ mean_arr_udf = pandas_udf(
+ self.pandas_agg_mean_udf.func,
--- End diff --
Btw, if type coercion is not working with array type, I think it's still fine to allow using array type and fix type coercion separately.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
Thank you for reviewing this, @icexelloss, @ueshin and @BryanCutler.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r166655615
--- Diff: python/pyspark/sql/udf.py ---
@@ -112,15 +112,31 @@ def returnType(self):
else:
self._returnType_placeholder = _parse_datatype_string(self._returnType)
- if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF \
- and not isinstance(self._returnType_placeholder, StructType):
- raise ValueError("Invalid returnType: returnType must be a StructType for "
- "pandas_udf with function type GROUPED_MAP")
- elif self.evalType == PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF \
- and isinstance(self._returnType_placeholder, (StructType, ArrayType, MapType)):
- raise NotImplementedError(
- "ArrayType, StructType and MapType are not supported with "
- "PandasUDFType.GROUPED_AGG")
+ if self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
+ if isinstance(self._returnType_placeholder, StructType):
+ try:
+ to_arrow_schema(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a grouped map Pandas UDF: "
+ "%s is not supported" % str(self._returnType_placeholder))
+ else:
+ raise TypeError("Invalid returnType for a grouped map Pandas "
+ "UDF: returnType must be a StructType.")
+ elif self.evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF:
+ try:
+ to_arrow_type(self._returnType_placeholder)
+ except TypeError:
+ raise NotImplementedError(
+ "Invalid returnType with a scalar Pandas UDF: %s is "
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87160/testReport)** for PR 20531 at commit [`ec708d5`](https://github.com/apache/spark/commit/ec708d58001be1382cabbe4357cbc68e2d51a8b6).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87163/testReport)** for PR 20531 at commit [`7cbebaf`](https://github.com/apache/spark/commit/7cbebaf1ad20c4d461a19d4d84b2990676e566ac).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87281/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20531
**[Test build #87329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87329/testReport)** for PR 20531 at commit [`07f2d78`](https://github.com/apache/spark/commit/07f2d781022a44186fe5ab4c5621d80acce0711f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20531
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/20531
@HyukjinKwon Looks good to me at high level. Left some comments.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by icexelloss <gi...@git.apache.org>.
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/20531
@HyukjinKwon LGTM! My only comment left is https://github.com/apache/spark/pull/20531#discussion_r166653988 . But we can have another PR for testing type coercion with array type.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20531: [SPARK-23352][PYTHON] Explicitly specify supported types...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20531
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167029101
--- Diff: docs/sql-programming-guide.md ---
@@ -1734,7 +1734,7 @@ For detailed usage, please see [`pyspark.sql.functions.pandas_udf`](api/python/p
### Supported SQL Types
-Currently, all Spark SQL data types are supported by Arrow-based conversion except `MapType`,
+Currently, all Spark SQL data types are supported by Arrow-based conversion except `BinaryType`, `MapType`,
--- End diff --
should `BinaryType` be added to the unsupported types with arrow.enabled in SQLConf.scala?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20531: [SPARK-23352][PYTHON] Explicitly specify supporte...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20531#discussion_r167718299
--- Diff: python/pyspark/sql/types.py ---
@@ -1638,6 +1638,8 @@ def to_arrow_type(dt):
# Timestamps should be in UTC, JVM Arrow timestamps require a timezone to be read
arrow_type = pa.timestamp('us', tz='UTC')
elif type(dt) == ArrayType:
+ if type(dt.elementType) == TimestampType:
+ raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
--- End diff --
What is the behavior before this PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org