You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/12 21:18:06 UTC

[GitHub] [spark] dtenedor opened a new pull request, #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

dtenedor opened a new pull request, #37501:
URL: https://github.com/apache/spark/pull/37501

   ### What changes were proposed in this pull request?
   
   Fix a bug in column DEFAULT support for non-vectorized Parquet scans, where inserting explicit NULL values to a column with a DEFAULT and then selecting the column back would sometimes erroneously return the default value. 
   
   To exercise the behavior:
   
   ```
   set spark.sql.parquet.enableVectorizedReader=false;
   create table t(a int) using parquet;
   insert into t values (42);
   alter table t add column b int default 42;
   insert into t values (43, null);
   select * from t;
   ```
   
   This should return two rows:
   
   `(42, 42) and (43, NULL)`
   
   But instead the scan missed the inserted NULL value, and returned the existence DEFAULT value of "42" instead:
   
   `(42, 42) and (43, 42)`.
   
   After this bug fix, Spark now returns the former correct result.
   
   ### Why are the changes needed?
   
   This fixes the correctness of SQL queries using Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The PR includes unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37501:
URL: https://github.com/apache/spark/pull/37501#issuecomment-1213569970

   Hi @gengliangwang thank you for walking through the logic, it helps to move the code out of the loop. This is ready when you are again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37501:
URL: https://github.com/apache/spark/pull/37501#issuecomment-1214196592

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37501:
URL: https://github.com/apache/spark/pull/37501#issuecomment-1213528860

   Hi @gengliangwang here is the fix for the column DEFAULT non-vectorized Parquet bug. Looks like we can simplify along the way as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37501:
URL: https://github.com/apache/spark/pull/37501#discussion_r944887434


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala:
##########
@@ -282,14 +239,17 @@ private[parquet] class ParquetRowConverter(
       // Create a RowUpdater instance for converting Parquet objects to Catalyst rows. If any fields
       // in the Catalyst result schema have associated existence default values, maintain a boolean
       // array to track which fields have been explicitly assigned for each row.
-      val rowUpdater: RowUpdater =
-        if (catalystType.hasExistenceDefaultValues) {
-          resetExistenceDefaultsBitmask(catalystType)
-          new RowUpdaterWithBitmask(
-            currentRow, catalystFieldIndex, catalystType.existenceDefaultsBitmask)
-        } else {
-          new RowUpdater(currentRow, catalystFieldIndex)
+      val rowUpdater: RowUpdater = new RowUpdater(currentRow, catalystFieldIndex)
+      if (catalystType.hasExistenceDefaultValues) {
+        for (i <- 0 until catalystType.existenceDefaultValues.size) {
+          catalystType.existenceDefaultsBitmask(i) =
+            if (i < parquetType.getFieldCount) {

Review Comment:
   Why comparing with `parquetType.getFieldCount` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37501:
URL: https://github.com/apache/spark/pull/37501#issuecomment-1213638831

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #37501:
URL: https://github.com/apache/spark/pull/37501#discussion_r944913477


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala:
##########
@@ -282,14 +239,17 @@ private[parquet] class ParquetRowConverter(
       // Create a RowUpdater instance for converting Parquet objects to Catalyst rows. If any fields
       // in the Catalyst result schema have associated existence default values, maintain a boolean
       // array to track which fields have been explicitly assigned for each row.
-      val rowUpdater: RowUpdater =
-        if (catalystType.hasExistenceDefaultValues) {
-          resetExistenceDefaultsBitmask(catalystType)
-          new RowUpdaterWithBitmask(
-            currentRow, catalystFieldIndex, catalystType.existenceDefaultsBitmask)
-        } else {
-          new RowUpdater(currentRow, catalystFieldIndex)
+      val rowUpdater: RowUpdater = new RowUpdater(currentRow, catalystFieldIndex)
+      if (catalystType.hasExistenceDefaultValues) {
+        for (i <- 0 until catalystType.existenceDefaultValues.size) {
+          catalystType.existenceDefaultsBitmask(i) =
+            if (i < parquetType.getFieldCount) {

Review Comment:
   We discussed this offline. I moved this code out of the `parquetType.getFields.asScala.map { parquetField => ...` loop, and also ported the explanation into a comment here:
   
   ```
   // Assume the schema for a Parquet file-based table contains N fields. Then if we later
   // run a command "ALTER TABLE t ADD COLUMN c DEFAULT <value>" on the Parquet table, this
   // adds one field to the Catalyst schema. Then if we query the old files with the new
   // Catalyst schema, we should only apply the existence default value to all columns > N.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans

Posted by GitBox <gi...@apache.org>.

gengliangwang closed pull request #37501: [SPARK-39926][SQL] Fix bug in column DEFAULT support for non-vectorized Parquet scans
URL: https://github.com/apache/spark/pull/37501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org