You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2016/04/12 20:23:25 UTC

[jira] [Created] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

Cheng Lian created SPARK-14566:
----------------------------------

             Summary: When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema
                 Key: SPARK-14566
                 URL: https://issues.apache.org/jira/browse/SPARK-14566
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Cheng Lian
            Assignee: Cheng Lian


Take the following snippets slightly modified from test case "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example:

{code}
val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j")
df1.write.partitionBy("i").saveAsTable("tbl11453")

val df2 = Seq("3" -> "30").toDF("i", "j")
df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453")
{code}

Although {{df1.schema}} is {{<i:STRING, j:STRING>}}, schema of persisted table {{tbl11453}} is actually {{<j:STRING, i:STRING>}} because {{i}} is a partition column, which is always appended after all data columns. Thus, when appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are actually different.

In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply applies existing metastore schema to the input query plan ([see here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]), which is wrong. A projection should be used instead to adjust column order here.

In branch-1.6, [this projection is added in {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104], but was removed in Spark 2.0. Replacing the aforementioned line in {{CreateMetastoreDataSourceAsSelect}} with a projection should more preferrable.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org