You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/11 14:57:39 UTC

[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19474

    [SPARK-22252][SQL] FileFormatWriter should respect the input query schema

    ## What changes were proposed in this pull request?
    
    In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).
    
    However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.
    
    Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in #18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.
    
    So the fix is simple, this PR partially revert #18064 and make `InsertIntoXXX` commands leaf nodes again.
     
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19474
    
----
commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-10-11T14:30:38Z

    FileFormatWriter should respect the input query schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19474#discussion_r144180666
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ---
    @@ -117,7 +117,7 @@ object FileFormatWriter extends Logging {
         job.setOutputValueClass(classOf[InternalRow])
         FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
     
    -    val allColumns = plan.output
    +    val allColumns = queryExecution.logical.output
    --- End diff --
    
    I think it'd be good to leave a comment that we should not use optimized output here in case it will be changed in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)** for PR 19474 at commit [`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    cc @gatorsmile @viirya


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82661/testReport)** for PR 19474 at commit [`5bdaf7d`](https://github.com/apache/spark/commit/5bdaf7d88475c64ca6d119e913eba73b1ac8e2e9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    > The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.
    
    Totally agreed. LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)** for PR 19474 at commit [`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    **[Test build #82639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)** for PR 19474 at commit [`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `trait Command extends LeafNode `
      * `trait RunnableCommand extends Command `
      * `case class ExecutedCommandExec(cmd: RunnableCommand) extends LeafExecNode `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    LGTM pending Jenkins.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Minor comments. LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    For a simple command `Seq(1 -> "a").toDF("i", "j").write.parquet("/tmp/qwe")`, the UI before this PR:
    <img width="1017" alt="p1" src="https://user-images.githubusercontent.com/3182036/31452520-bc74bb44-aee1-11e7-8721-234925856411.png">
    
    The UI after this PR:
    <img width="798" alt="p2" src="https://user-images.githubusercontent.com/3182036/31452534-c6ba5622-aee1-11e7-865d-b2af359d529d.png">
    
    The scan node is no longer visible above the insert node, I'll fix this later. The writer bug is more important and we should fix it ASAP.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    I like this change because the relation between `ExecutedCommandExec` and `RunnableCommand` is a little entangled.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19474
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82640/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org