You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/11 14:57:39 UTC

[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19474

    [SPARK-22252][SQL] FileFormatWriter should respect the input query schema

    ## What changes were proposed in this pull request?
    
    In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).
    
    However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.
    
    Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.
    
    So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again.
     
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19474
    
----
commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-10-11T14:30:38Z

    FileFormatWriter should respect the input query schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144180666
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
         job.setOutputValueClass(classOf[InternalRow])
         FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
     
    -    val allColumns = plan.output
    +    val allColumns = queryExecution.logical.output
    --- End diff --
    
    I think it'd be good to leave a comment that we should not use optimized output here in case it will be changed in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    cc @gatorsmile @viirya


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82661/testReport)** for PR 19474 at commit [`5bdaf7d`](https://github.com/apache/spark/commit/5bdaf7d88475c64ca6d119e913eba73b1ac8e2e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    > The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.
    
    Totally agreed. LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait Command extends LeafNode `
      * `trait RunnableCommand extends Command `
      * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    LGTM pending Jenkins.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Minor comments. LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    For a simple command `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`, the UI before this PR:
    <img width="1017" alt="p1" src="https://user-images.githubusercontent.com/3182036/31452520-bc74bb44-aee1-11e7-8721-234925856411.png">
    
    The UI after this PR:
    <img width="798" alt="p2" src="https://user-images.githubusercontent.com/3182036/31452534-c6ba5622-aee1-11e7-865d-b2af359d529d.png">
    
    The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    I like this change because the relation between `ExecutedCommandExec` and `RunnableCommand` is a little entangled.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82640/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144198270
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
         job.setOutputValueClass(classOf[InternalRow])
         FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
     
    -    val allColumns = plan.output
    +    val allColumns = queryExecution.logical.output
    --- End diff --
    
    Explicitly using `analyzed`'s schema is better here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144197934
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ---
    @@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration
      */
     trait DataWritingCommand extends RunnableCommand {
     
    +  def query: LogicalPlan
    +
    +  // We make the input `query` an inner child instead of a child in order to hide it from the
    +  // optimizer. This is because optimizer may change the output schema names, and we have to keep
    --- End diff --
    
    You will scare others. :) 
    
    -> `may not preserve the output schema names' case`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    thanks for review, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait Command extends LeafNode `
      * `trait RunnableCommand extends Command `
      * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144180788
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
         job.setOutputValueClass(classOf[InternalRow])
         FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
     
    -    val allColumns = plan.output
    +    val allColumns = queryExecution.logical.output
    --- End diff --
    
    Btw, shall we use `queryExecution.analyzed.output`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19474


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144198505
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ---
    @@ -30,6 +31,15 @@ import org.apache.spark.util.SerializableConfiguration
      */
     trait DataWritingCommand extends RunnableCommand {
     
    +  def query: LogicalPlan
    --- End diff --
    
    Add one line description for `query`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82661/testReport)** for PR 19474 at commit [`5bdaf7d`](https://github.com/apache/spark/commit/5bdaf7d88475c64ca6d119e913eba73b1ac8e2e9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144197368
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala ---
    @@ -30,4 +31,12 @@ class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
           assert(partFiles.length === 2)
         }
       }
    +
    +  test("FileFormatWriter should respect the input query schema") {
    +    withTable("t1", "t2") {
    +      spark.range(1).select('id as 'col1, 'id as 'col2).write.saveAsTable("t1")
    --- End diff --
    
    Also add another case here?
    ```
    spark.range(1).select('id, 'id as 'col1, 'id as 'col2).write.saveAsTable("t3")
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82639/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144210692
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -117,7 +117,9 @@ object FileFormatWriter extends Logging {
         job.setOutputValueClass(classOf[InternalRow])
         FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
     
    -    val allColumns = plan.output
    +    // Pick the attributes from analyzed plan, as optimizer may not preserve the output schema
    +    // names' case.
    +    val allColumns = queryExecution.analyzed.output
         val partitionSet = AttributeSet(partitionColumns)
    --- End diff --
    
    You might need to double check the `partitionColumns ` in all the other files are also from analyzed plans. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82638/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82661/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait Command extends LeafNode `
      * `trait RunnableCommand extends Command `
      * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org