You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2015/06/02 14:40:51 UTC

[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/6583

    [SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append

    The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-8014

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6583.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6583
    
----
commit 8fbd93fe485d9d212944f340933235226e67ee82
Author: Cheng Lian <li...@databricks.com>
Date:   2015-06-02T12:35:26Z

    Fixes SPARK-8014

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6583#issuecomment-107940853
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6583#discussion_r31517920
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala ---
    @@ -453,6 +456,20 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils {
           }
         }
       }
    +
    +  test("SPARK-7616: adjust column name order accordingly when saving partitioned table") {
    +    val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c")
    +
    +    df.write
    +      .format(dataSourceName)
    +      .mode(SaveMode.Overwrite)
    +      .partitionBy("c", "a")
    +      .saveAsTable("t")
    +
    +    withTable("t") {
    +      checkAnswer(table("t"), df.select('b, 'c, 'a).collect())
    +    }
    +  }
    --- End diff --
    
    Moved this test case here so that it gets executed for all test suite that extend `HadoopFsRelationTest`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6583#issuecomment-107976638
  
      [Test build #33981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33981/consoleFull) for   PR 6583 at commit [`8fbd93f`](https://github.com/apache/spark/commit/8fbd93fe485d9d212944f340933235226e67ee82).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6583#issuecomment-107976657
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/6583#issuecomment-107940921
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6583#discussion_r31517707
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala ---
    @@ -322,19 +322,10 @@ private[sql] object ResolvedDataSource {
               Some(partitionColumnsSchema(data.schema, partitionColumns)),
               caseInsensitiveOptions)
     
    -        // For partitioned relation r, r.schema's column ordering is different with the column
    -        // ordering of data.logicalPlan. We need a Project to adjust the ordering.
    -        // So, inside InsertIntoHadoopFsRelation, we can safely apply the schema of r.schema to
    -        // the data.
    -        val project =
    -          Project(
    -            r.schema.map(field => new UnresolvedAttribute(Seq(field.name))),
    --- End diff --
    
    This `r.schema` is where metadata discovery is triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-8014] [SQL] Avoid premature metadata di...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/6583#issuecomment-107941655
  
      [Test build #33981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33981/consoleFull) for   PR 6583 at commit [`8fbd93f`](https://github.com/apache/spark/commit/8fbd93fe485d9d212944f340933235226e67ee82).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org