You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2022/07/12 16:20:43 UTC

[spark] branch master updated: [SPARK-39706][SQL] Set missing column with defaultValue as constant in `ParquetColumnVector`

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new e0d4ef4b0bd [SPARK-39706][SQL] Set missing column with defaultValue as constant in `ParquetColumnVector`
e0d4ef4b0bd is described below

commit e0d4ef4b0bd2c8641b830106b0cb6063351ad5da
Author: yangjie01 <ya...@baidu.com>
AuthorDate: Tue Jul 12 09:20:24 2022 -0700

    [SPARK-39706][SQL] Set missing column with defaultValue as constant in `ParquetColumnVector`
    
    ### What changes were proposed in this pull request?
    The change of this pr is add `vector.setIsConstant()` when missing column with defaultValue and `vector.appendObjects(capacity, defaultValue).isPresent()` is true during `ParquetColumnVector` initialization.
    
    ### Why are the changes needed?
    This is just a minor improvement, for the missing column with default value, setting isConstant to true can will prevent the `reset()` method from restoring the internal state of `WritableColumnVector`. `OrcColumnarBatchReader` has done similar things to missing column.
    
    https://github.com/apache/spark/blob/bb4c4778713c7ba1ee92d0bb0763d7d3ce54374f/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java#L178-L191
    
    Without this change, there will be no bug, because missing column will only be initialized once and the corresponding columnReader is null,  the reset() method will only reset `.WritableColumnVector#elementsAppended` to 0, but this will not affect anything.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass GitHub Actions
    
    Closes #37115 from LuciferYang/setIsConstant.
    
    Lead-authored-by: yangjie01 <ya...@baidu.com>
    Co-authored-by: YangJie <ya...@baidu.com>
    Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
 .../spark/sql/execution/datasources/parquet/ParquetColumnVector.java    | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java
index 2ad8cdfcca6..47774e0a397 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java
@@ -89,6 +89,8 @@ final class ParquetColumnVector {
         throw new IllegalArgumentException("Cannot assign default column value to result " +
           "column batch in vectorized Parquet reader because the data type is not supported: " +
           defaultValue);
+      } else {
+        vector.setIsConstant();
       }
     }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org