You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/10/26 03:10:07 UTC

[GitHub] spark pull request #19579: [SPARK-22356][SQL] data source table should suppo...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19579

    [SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema

    ## What changes were proposed in this pull request?
    
    This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.
    
    To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.
    
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark bug2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19579
    
----
commit 18907cb2359efb9b4e874482916de04af9cf90a2
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-10-26T01:26:39Z

    overlapped columns between data and partition schema in data source tables

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    Thanks! Merged to ...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19579: [SPARK-22356][SQL] data source table should suppo...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19579#discussion_r147264509
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala ---
    @@ -85,14 +86,28 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo
           }
         }
     
    -    val newTable = table.copy(
    -      schema = dataSource.schema,
    -      partitionColumnNames = partitionColumnNames,
    -      // If metastore partition management for file source tables is enabled, we start off with
    -      // partition provider hive, but no partitions in the metastore. The user has to call
    -      // `msck repair table` to populate the table partitions.
    -      tracksPartitionsInCatalog = partitionColumnNames.nonEmpty &&
    -        sessionState.conf.manageFilesourcePartitions)
    +    val newTable = dataSource match {
    +      // Since Spark 2.1, we store the inferred schema of data source in metastore, to avoid
    +      // inferring the schema again at read path. However if the data source has overlapped columns
    +      // between data and partition schema, we can't store it in metastore as it breaks the
    +      // assumption of table schema. Here we fallback to the behavior of Spark prior to 2.1, store
    +      // empty schema in metastore and infer it at runtime. Note that this also means the new
    +      // scalable partitioning handling feature(introduced at Spark 2.1) is disabled in this case.
    +      case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
    +        table.copy(schema = new StructType(), partitionColumnNames = Nil)
    --- End diff --
    
    Log a warning message here? When data columns and partition columns have the same names, the values could be inconsistent. Thus, we do not suggest users to create such a table and it might perform well because we infer the schema at runtime. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    **[Test build #83094 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83094/testReport)** for PR 19579 at commit [`2fe7ec4`](https://github.com/apache/spark/commit/2fe7ec42c308a08da0f3bd064a75aa95060fff7b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    cc @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    **[Test build #83094 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83094/testReport)** for PR 19579 at commit [`2fe7ec4`](https://github.com/apache/spark/commit/2fe7ec42c308a08da0f3bd064a75aa95060fff7b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    **[Test build #83073 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83073/testReport)** for PR 19579 at commit [`18907cb`](https://github.com/apache/spark/commit/18907cb2359efb9b4e874482916de04af9cf90a2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    **[Test build #83073 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83073/testReport)** for PR 19579 at commit [`18907cb`](https://github.com/apache/spark/commit/18907cb2359efb9b4e874482916de04af9cf90a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83094/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83073/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19579: [SPARK-22356][SQL] data source table should suppo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19579


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19579: [SPARK-22356][SQL] data source table should support over...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19579
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org