You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/11 14:57:39 UTC
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19474
[SPARK-22252][SQL] FileFormatWriter should respect the input query schema
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).
However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.
Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.
So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again.
## How was this patch tested?
new regression test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark bug
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19474.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19474
----
commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc
Author: Wenchen Fan <we...@databricks.com>
Date: 2017-10-11T14:30:38Z
FileFormatWriter should respect the input query schema
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144180666
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
job.setOutputValueClass(classOf[InternalRow])
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
- val allColumns = plan.output
+ val allColumns = queryExecution.logical.output
--- End diff --
I think it'd be good to leave a comment that we should not use optimized output here in case it will be changed in the future.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19474
cc @gatorsmile @viirya
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82661/testReport)** for PR 19474 at commit [`5bdaf7d`](https://github.com/apache/spark/commit/5bdaf7d88475c64ca6d119e913eba73b1ac8e2e9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19474
> The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.
Totally agreed. LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait Command extends LeafNode `
* `trait RunnableCommand extends Command `
* `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19474
LGTM pending Jenkins.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19474
Minor comments. LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19474
For a simple command `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`, the UI before this PR:
<img width="1017" alt="p1" src="https://user-images.githubusercontent.com/3182036/31452520-bc74bb44-aee1-11e7-8721-234925856411.png">
The UI after this PR:
<img width="798" alt="p2" src="https://user-images.githubusercontent.com/3182036/31452534-c6ba5622-aee1-11e7-865d-b2af359d529d.png">
The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19474
I like this change because the relation between `ExecutedCommandExec` and `RunnableCommand` is a little entangled.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82640/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144198270
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
job.setOutputValueClass(classOf[InternalRow])
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
- val allColumns = plan.output
+ val allColumns = queryExecution.logical.output
--- End diff --
Explicitly using `analyzed`'s schema is better here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144197934
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ---
@@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration
*/
trait DataWritingCommand extends RunnableCommand {
+ def query: LogicalPlan
+
+ // We make the input `query` an inner child instead of a child in order to hide it from the
+ // optimizer. This is because optimizer may change the output schema names, and we have to keep
--- End diff --
You will scare others. :)
-> `may not preserve the output schema names' case`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19474
thanks for review, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait Command extends LeafNode `
* `trait RunnableCommand extends Command `
* `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144180788
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
@@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
job.setOutputValueClass(classOf[InternalRow])
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
- val allColumns = plan.output
+ val allColumns = queryExecution.logical.output
--- End diff --
Btw, shall we use `queryExecution.analyzed.output`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19474
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144198505
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ---
@@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration
*/
trait DataWritingCommand extends RunnableCommand {
+ def query: LogicalPlan
--- End diff --
Add one line description for `query`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82661/testReport)** for PR 19474 at commit [`5bdaf7d`](https://github.com/apache/spark/commit/5bdaf7d88475c64ca6d119e913eba73b1ac8e2e9).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144197368
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala ---
@@ -30,4 +31,12 @@ class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
assert(partFiles.length === 2)
}
}
+
+ test("FileFormatWriter should respect the input query schema") {
+ withTable("t1", "t2") {
+ spark.range(1).select('id as 'col1, 'id as 'col2).write.saveAsTable("t1")
--- End diff --
Also add another case here?
```
spark.range(1).select('id, 'id as 'col1, 'id as 'col2).write.saveAsTable("t3")
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82639/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19474#discussion_r144210692
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
@@ -117,7 +117,9 @@ object FileFormatWriter extends Logging {
job.setOutputValueClass(classOf[InternalRow])
FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
- val allColumns = plan.output
+ // Pick the attributes from analyzed plan, as optimizer may not preserve the output schema
+ // names' case.
+ val allColumns = queryExecution.analyzed.output
val partitionSet = AttributeSet(partitionColumns)
--- End diff --
You might need to double check the `partitionColumns ` in all the other files are also from analyzed plans.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82638/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19474
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82661/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19474
**[Test build #82638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `trait Command extends LeafNode `
* `trait RunnableCommand extends Command `
* `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org