You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2018/02/05 10:18:50 UTC
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
GitHub user ueshin opened a pull request:
https://github.com/apache/spark/pull/20507
[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
## What changes were proposed in this pull request?
In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g.,
```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd
df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```
raises the following exception:
```
...
java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
...
```
Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.
This pr adds a workaround for the case.
## How was this patch tested?
Added a test and existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ueshin/apache-spark issues/SPARK-23334
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20507.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20507
----
commit 47b88734b91a7f9a4335bc3c667640eb4600b8e1
Author: Takuya UESHIN <ue...@...>
Date: 2018-02-05T09:30:20Z
Fix pandas_udf with return type StringType() to handle str type properly.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/604/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)** for PR 20507 at commit [`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r165972212
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
res = df.select(str_f(col('str')))
self.assertEquals(df.collect(), res.collect())
+ def test_vectorized_udf_string_in_udf(self):
+ from pyspark.sql.functions import pandas_udf, col
+ import pandas as pd
+ df = self.spark.range(10)
+ str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
--- End diff --
Not a big deal. How about `pd.Series(map(str, x))`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/20507
> Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.
@BryanCutler Btw, do you think this is a bug of pyarrow in Python 2?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87063/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r165968902
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
res = df.select(str_f(col('str')))
self.assertEquals(df.collect(), res.collect())
+ def test_vectorized_udf_string_in_udf(self):
+ from pyspark.sql.functions import pandas_udf, col
+ import pandas as pd
+ df = self.spark.range(10)
+ str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
+ res = df.select(str_f(col('id')))
--- End diff --
How about variable names 'expected' and 'actual'?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/20507
Merged to master and branch-2.3.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87083 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87083/testReport)** for PR 20507 at commit [`b3d5209`](https://github.com/apache/spark/commit/b3d5209b26322329d7e4ba1fd1b1457f86b44a8a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87083 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87083/testReport)** for PR 20507 at commit [`b3d5209`](https://github.com/apache/spark/commit/b3d5209b26322329d7e4ba1fd1b1457f86b44a8a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/20507
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r166018470
--- Diff: python/pyspark/serializers.py ---
@@ -230,6 +230,9 @@ def create_array(s, t):
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+ elif t is not None and pa.types.is_string(t) and sys.version < '3':
+ # TODO: need decode before converting to Arrow in Python 2
+ return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, type=t)
--- End diff --
@ueshin, actually, how about `s.apply(lambda v: v.decode("utf-8") if isinstance(v, str) else v)` to allow non-ascii unicodes too like `u"아"`? I was worried of performance but I ran a simple perf test vs `s.str.decode('utf-8')` for sure. Seems actually fine.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87083/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r165980594
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
res = df.select(str_f(col('str')))
self.assertEquals(df.collect(), res.collect())
+ def test_vectorized_udf_string_in_udf(self):
+ from pyspark.sql.functions import pandas_udf, col
+ import pandas as pd
+ df = self.spark.range(10)
+ str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
+ res = df.select(str_f(col('id')))
--- End diff --
Sure, I'll update it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/592/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/20507
Sorry I've been travelling, but I'll try to look into this soon on the Arrow side to see if it is a bug in pyarrow. The workaround here seems fine to me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r166168248
--- Diff: python/pyspark/serializers.py ---
@@ -230,6 +230,9 @@ def create_array(s, t):
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
# TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+ elif t is not None and pa.types.is_string(t) and sys.version < '3':
+ # TODO: need decode before converting to Arrow in Python 2
+ return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, type=t)
--- End diff --
Good catch! I'll take it. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with re...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r165980572
--- Diff: python/pyspark/sql/tests.py ---
@@ -3920,6 +3920,14 @@ def test_vectorized_udf_null_string(self):
res = df.select(str_f(col('str')))
self.assertEquals(df.collect(), res.collect())
+ def test_vectorized_udf_string_in_udf(self):
+ from pyspark.sql.functions import pandas_udf, col
+ import pandas as pd
+ df = self.spark.range(10)
+ str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
--- End diff --
Sounds good. I'll take it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87069/testReport)** for PR 20507 at commit [`06ae568`](https://github.com/apache/spark/commit/06ae568df2088652754c2df66d2f78c8fbdac48d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87063 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)** for PR 20507 at commit [`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/20507
also cc @cloud-fan @gatorsmile @sameeragarwal
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20507
**[Test build #87063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87063/testReport)** for PR 20507 at commit [`47b8873`](https://github.com/apache/spark/commit/47b88734b91a7f9a4335bc3c667640eb4600b8e1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87069/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/20507
Thanks! @HyukjinKwon @BryanCutler
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/20507
I made https://issues.apache.org/jira/browse/ARROW-2101 to track the issue in Arrow
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/586/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20507
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #20507: [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return ty...
Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/20507
cc @BryanCutler @icexelloss @HyukjinKwon
Could you help me double-check this?
Since seems like this happens only in Python 2 environment, Jenkins will skip the tests.
And let me know if you know better workaround.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org