You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by xuanyuanking <gi...@git.apache.org> on 2018/08/18 08:42:26 UTC
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
GitHub user xuanyuanking opened a pull request:
https://github.com/apache/spark/pull/22140
[SPARK-25072][PySpark] Forbid extra value for custom Row
## What changes were proposed in this pull request?
Add value length check in `_create_row`, forbid extra value for custom Row in PySpark.
## How was this patch tested?
New UT in pyspark-sql
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuanyuanking/spark SPARK-25072
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22140.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22140
----
commit b8c6522bccde51584e9878144924fd7b92f8785f
Author: liyuanjian <li...@...>
Date: 2018-08-18T08:36:53Z
Forbidden extra value for custom Row
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/22140
```
@xuanyuanking Could you please update the document?
```
#22369 Thanks for reminding, I'll pay attention in future work.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22140
**[Test build #95756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95756/testReport)** for PR 22140 at commit [`eb3f506`](https://github.com/apache/spark/commit/eb3f506817e6cb99230853ffd5c50e3299527d4b).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22140
**[Test build #94920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94920/testReport)** for PR 22140 at commit [`b8c6522`](https://github.com/apache/spark/commit/b8c6522bccde51584e9878144924fd7b92f8785f).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22140
@BryanCutler @HyukjinKwon Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.
I just reverted it from branch 2.3. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215601543
--- Diff: python/pyspark/sql/tests.py ---
@@ -269,6 +269,10 @@ def test_struct_field_type_name(self):
struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)
+ def test_invalid_create_row(slef):
+ rowClass = Row("c1", "c2")
--- End diff --
Thanks, done in eb3f506.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22140
**[Test build #94920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94920/testReport)** for PR 22140 at commit [`b8c6522`](https://github.com/apache/spark/commit/b8c6522bccde51584e9878144924fd7b92f8785f).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22140
cc @BryanCutler as well since we discussed an issue about this code path before.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/22140
AFAIC, the fix should forbid illegal extra value passing. If less values than fields it should get a `AttributeError` while accessing as the currently implement, not ban it here? What do you think :) @HyukjinKwon @BryanCutler Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215483530
--- Diff: python/pyspark/sql/tests.py ---
@@ -269,6 +269,10 @@ def test_struct_field_type_name(self):
struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)
+ def test_invalid_create_row(slef):
+ rowClass = Row("c1", "c2")
--- End diff --
nit: `rowClass` -> `row_class`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215601350
--- Diff: python/pyspark/sql/tests.py ---
@@ -269,6 +269,10 @@ def test_struct_field_type_name(self):
struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)
+ def test_invalid_create_row(slef):
--- End diff --
Thanks, done in eb3f506.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215356501
--- Diff: python/pyspark/sql/types.py ---
@@ -1397,6 +1397,8 @@ def _create_row_inbound_converter(dataType):
def _create_row(fields, values):
+ if len(values) > len(fields):
+ raise ValueError("Can not create %s by %s" % (fields, values))
--- End diff --
I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
Does it make any sense to have less values than fields? Maybe we should check that they are equal, wdyt @HyukjinKwon ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95756/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
> Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version.
Yes, I know. It seemed to me at the time as failing fast rather than later and improving the error message, but best to be safe. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22140
@BryanCutler What is the reason to backport this PR? This sounds a behavior change.
@xuanyuanking Could you please update the document?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22140
Yea, actually I wouldn't at least backport this to branch-2.3 since the release is very close. Looks a bug to me as well.
One nitpicking is the case with RDD operation:
```python
>>> from pyspark.sql import Row
>>> row_class = Row("c1", "c2")
>>> row = row_class(1, 2, 3)
>>> spark.sparkContext.parallelize([row]).map(lambda r: r.c1).collect()
[1]
```
This is really unlikely and I even wonder if it makes any sense, but still there might be a case although the creation of the namedtuple like row itself should be disallowed, as fixed here.
Can we just simply take this out from branch-2.3?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215601486
--- Diff: python/pyspark/sql/types.py ---
@@ -1397,6 +1397,8 @@ def _create_row_inbound_converter(dataType):
def _create_row(fields, values):
+ if len(values) > len(fields):
+ raise ValueError("Can not create %s by %s" % (fields, values))
--- End diff --
Thanks, improve done and move this check to `__call__` in `Row`. eb3f506
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/22140#discussion_r215355538
--- Diff: python/pyspark/sql/tests.py ---
@@ -269,6 +269,10 @@ def test_struct_field_type_name(self):
struct_field = StructField("a", IntegerType())
self.assertRaises(TypeError, struct_field.typeName)
+ def test_invalid_create_row(slef):
--- End diff --
typo: `slef` -> `self`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking !
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22140: [SPARK-25072][PySpark] Forbid extra value for cus...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/22140
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/22140
Thanks @BryanCutler @HyukjinKwon !
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2901/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22140
@BryanCutler, for https://github.com/apache/spark/pull/22140#issuecomment-414802978, yea, to me it looks less sense actually but seems at least working for now:
```python
from pyspark.sql import Row
rowClass = Row("c1", "c2")
spark.createDataFrame([rowClass(1)]).show()
```
```
+---+
| c1|
+---+
| 1|
+---+
```
I think we should consider disallowing it in 3.0.0 given the test above.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2296/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94920/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
> Can we just simply take this out from branch-2.3?
Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsmile ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22140
We are very conservative when backporting the PR to the released version.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
@gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in `DataFrames`, so it did not seem like there was any possible this would break existing code. For example
```
In [1]: MyRow = Row('a','b')
In [2]: print(MyRow(1,2,3))
Row(a=1, b=2)
In [3]: spark.createDataFrame([MyRow(1,2,3)])
Out[3]: DataFrame[a: bigint, b: bigint]
In [4]: spark.createDataFrame([MyRow(1,2,3)]).show()
18/09/08 21:55:48 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 3 values are provided.
In [5]: spark.createDataFrame([MyRow(1,2,3)], schema="x: int, y: int").show()
ValueError: Length of object (3) does not match with length of fields (2)
```
Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/22140
cc @HyukjinKwon
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22140
**[Test build #95756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95756/testReport)** for PR 22140 at commit [`eb3f506`](https://github.com/apache/spark/commit/eb3f506817e6cb99230853ffd5c50e3299527d4b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by xuanyuanking <gi...@git.apache.org>.
Github user xuanyuanking commented on the issue:
https://github.com/apache/spark/pull/22140
gental ping @HyukjinKwon @BryanCutler
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on the issue:
https://github.com/apache/spark/pull/22140
> yea, to me it looks less sense actually but seems at least working for now:
good point, I guess it only fails when you supply a schema.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22140
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org