You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/25 22:03:53 UTC

[GitHub] [spark] dtenedor opened a new pull request, #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

dtenedor opened a new pull request, #36672:
URL: https://github.com/apache/spark/pull/36672

   ### What changes were proposed in this pull request?
   
   Support vectorized Parquet scans when the table schema has associated DEFAULT column values.
   
   (Note, this PR depends on https://github.com/apache/spark/pull/36643 which adds the non-vectorized version.)
   
   Example:
   
   ```
   create table t(i int) using parquet;
   insert into t values(42);
   alter table t add column s string default concat('abc', def');
   select * from t;
   > 42, 'abcdef'
   ```
   
   ### Why are the changes needed?
   
   This change makes it easier to build, query, and maintain tables backed by Parquet data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   This PR includes new test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1138800989

   Synced latest changes from master, this PR no longer depends on any other unmerged PRs anymore


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1139324091

   cc @sadikovi too FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

sadikovi commented on code in PR #36672:
URL: https://github.com/apache/spark/pull/36672#discussion_r885096588


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java:
##########
@@ -270,13 +271,40 @@ private void initBatch(
         vectors[i + partitionIdx].setIsConstant();
       }
     }
+
+    // For Parquet tables whose columns have associated DEFAULT values, this reader must return
+    // those values instead of NULL when the corresponding columns are not present in storage (i.e.
+    // belong to the 'missingColumns' field in this class).
+    ColumnVector[] finalColumns = new ColumnVector[sparkSchema.fields().length];
+    for (int i = 0; i < columnVectors.length; i++) {
+      Object defaultValue = sparkRequestedSchema.existenceDefaultValues()[i];
+      if (defaultValue == null) {
+        finalColumns[i] = vectors[i];
+      } else {
+        WritableColumnVector writable;
+        if (memMode == MemoryMode.OFF_HEAP) {
+          writable = new OffHeapColumnVector(capacity, vectors[i].dataType());

Review Comment:
   Could you elaborate how `appendObjects` replaces null values? Also, should we still preserve ColumnVector here instead of creating a new one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1139648450

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1144408657

   The test log is a bit messy .. just copying and pasting the error I saw:
   
   ```
   2022-06-02T02:22:55.9442627Z [0m[[0m[0minfo[0m] [0m[0m[31m  org.apache.spark.sql.AnalysisException: Failed to execute CREATE TABLE command because the destination table column s has a DEFAULT value of badvalue which fails to resolve as a valid expression: [MISSING_COLUMN] Column 'badvalue' does not exist. Did you mean one of the following? []; line 1 pos 0[0m[0m
   2022-06-02T02:22:55.9443938Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyze(ResolveDefaultColumnsUtil.scala:141)[0m[0m
   2022-06-02T02:22:55.9445083Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$constantFoldCurrentDefaultsToExistDefaults$1(ResolveDefaultColumnsUtil.scala:96)[0m[0m
   2022-06-02T02:22:55.9449272Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)[0m[0m
   2022-06-02T02:22:55.9451745Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)[0m[0m
   2022-06-02T02:22:55.9452379Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)[0m[0m
   2022-06-02T02:22:55.9452963Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)[0m[0m
   2022-06-02T02:22:55.9453505Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.TraversableLike.map(TraversableLike.scala:286)[0m[0m
   2022-06-02T02:22:55.9454049Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)[0m[0m
   2022-06-02T02:22:55.9454583Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)[0m[0m
   2022-06-02T02:22:55.9494729Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.constantFoldCurrentDefaultsToExistDefaults(ResolveDefaultColumnsUtil.scala:94)[0m[0m
   2022-06-02T02:22:55.9495832Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:153)[0m[0m
   2022-06-02T02:22:55.9496639Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:148)[0m[0m
   2022-06-02T02:22:55.9497537Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)[0m[0m
   2022-06-02T02:22:55.9498263Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)[0m[0m
   2022-06-02T02:22:55.9499007Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)[0m[0m
   2022-06-02T02:22:55.9499830Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)[0m[0m
   2022-06-02T02:22:55.9500716Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)[0m[0m
   2022-06-02T02:22:55.9501658Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)[0m[0m
   2022-06-02T02:22:55.9502569Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:30)[0m[0m
   2022-06-02T02:22:55.9503469Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)[0m[0m
   2022-06-02T02:22:55.9504354Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)[0m[0m
   2022-06-02T02:22:55.9505333Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:30)[0m[0m
   2022-06-02T02:22:55.9506175Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:76)[0m[0m
   2022-06-02T02:22:55.9506968Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:75)[0m[0m
   2022-06-02T02:22:55.9507748Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:30)[0m[0m
   2022-06-02T02:22:55.9508709Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:148)[0m[0m
   2022-06-02T02:22:55.9509477Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.execution.datasources.DataSourceAnalysis.apply(DataSourceStrategy.scala:64)[0m[0m
   2022-06-02T02:22:55.9510186Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)[0m[0m
   2022-06-02T02:22:55.9510811Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)[0m[0m
   2022-06-02T02:22:55.9511420Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)[0m[0m
   2022-06-02T02:22:55.9511976Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.immutable.List.foldLeft(List.scala:91)[0m[0m
   2022-06-02T02:22:55.9512576Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)[0m[0m
   2022-06-02T02:22:55.9513240Z [0m[[0m[0minfo[0m] [0m[0m[31m  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)[0m[0m
   2022-06-02T02:22:55.9513896Z [0m[[0m[0minfo[0m] [0m[0m[31m  at scala.collection.immutable.List.foreach(List.scala:431)[0m[0m
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1137889083

   @gengliangwang and/or @HyukjinKwon this is DEFAULT column support for the vectorized Parquet reader, it depends on https://github.com/apache/spark/pull/36643 which adds the non-vectorized support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

sadikovi commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1141516421

   Also, can we add a test to check that the DEFAULT values work? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1146161162

   @gengliangwang @HyukjinKwon @cloud-fan can someone please merge this in (or leave more review comment(s) if desired for another pass)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

gengliangwang closed pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values
URL: https://github.com/apache/spark/pull/36672


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1145200667

   @HyukjinKwon the CI passes now :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36672:
URL: https://github.com/apache/spark/pull/36672#discussion_r886234949


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java:
##########
@@ -270,13 +271,40 @@ private void initBatch(
         vectors[i + partitionIdx].setIsConstant();
       }
     }
+
+    // For Parquet tables whose columns have associated DEFAULT values, this reader must return
+    // those values instead of NULL when the corresponding columns are not present in storage (i.e.
+    // belong to the 'missingColumns' field in this class).
+    ColumnVector[] finalColumns = new ColumnVector[sparkSchema.fields().length];
+    for (int i = 0; i < columnVectors.length; i++) {
+      Object defaultValue = sparkRequestedSchema.existenceDefaultValues()[i];
+      if (defaultValue == null) {
+        finalColumns[i] = vectors[i];
+      } else {
+        WritableColumnVector writable;
+        if (memMode == MemoryMode.OFF_HEAP) {
+          writable = new OffHeapColumnVector(capacity, vectors[i].dataType());

Review Comment:
   Sure, I added a comment explaining this (`appendObjects` delegates to other existing methods like `appendFloats`). And I made a change to reuse the existing ColumnVector instead of creating a new one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36672: [SPARK-39265][SQL] Support vectorized Parquet scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36672:
URL: https://github.com/apache/spark/pull/36672#issuecomment-1142910763

   > Also, can we add a test to check that the DEFAULT values work? Thanks.
   @sadikovi Sure, this is done in `InsertSuite`, by adding a new configuration to the case covering Parquet files (previously it only covered the non-vectorized case, but now with `Config(None)` it also runs the test over the vectorized case as well):
   
   ```
         TestCase(
           dataSource = "parquet",
           Seq(
             Config(
               None),
             Config(
               Some(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false"),
               insertNullsToStorage = false)))
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org