You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/09 07:08:43 UTC

[GitHub] [spark] sadikovi commented on a diff in pull request #37419: [SPARK-39833][SQL] Remove partition columns from data schema in the case of overlapping columns to fix Parquet DSv1 incorrect count issue

sadikovi commented on code in PR #37419:
URL: https://github.com/apache/spark/pull/37419#discussion_r940976365


##########
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##########
@@ -2777,18 +2777,24 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
     }
   }
 
-  test("SPARK-22356: overlapped columns between data and partition schema in data source tables") {
+  test("SPARK-39833: overlapped columns between data and partition schema in data source tables") {
+    // SPARK-39833 changed behaviour of the column order in the case of overlapping columns between
+    // data and partition schemas: data schema does not include partition columns anymore and the
+    // overlapping columns would appear at the end of the schema together with other partition
+    // columns.
     withTempPath { path =>
       Seq((1, 1, 1), (1, 2, 1)).toDF("i", "p", "j")
         .write.mode("overwrite").parquet(new File(path, "p=1").getCanonicalPath)
       withTable("t") {
         sql(s"create table t using parquet options(path='${path.getCanonicalPath}')")
-        // We should respect the column order in data schema.
-        assert(spark.table("t").columns === Array("i", "p", "j"))
+        // MSCK command is required now to update partitions in the catalog.
+        sql(s"msck repair table t")
+
+        assert(spark.table("t").columns === Array("i", "j", "p"))
         checkAnswer(spark.table("t"), Row(1, 1, 1) :: Row(1, 1, 1) :: Nil)
         // The DESC TABLE should report same schema as table scan.
         assert(sql("desc t").select("col_name")
-          .as[String].collect().mkString(",").contains("i,p,j"))
+          .as[String].collect().mkString(",").contains("i,j,p"))

Review Comment:
   Partition columns are always appended to the schema. In the case of overlapping columns, we now remove all of the partition columns from the schema and append them afterwards. This does not change the result but changes the column output.
   
   Essentially:
   data schema: `i, p, j`, partition schema: `p`. We will remove `p` and append partition column: `i, j, p`.
   
   Previously we would keep the partition column as part of the data schema and insert partition values into it, which IMHO a bit confusing. This change also makes it compatible with DSv2 which is how it works there.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org