You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gengliang Wang (Jira)" <ji...@apache.org> on 2022/08/13 17:48:00 UTC

[jira] [Resolved] (SPARK-39926) Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans

     [ https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gengliang Wang resolved SPARK-39926.
------------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Issue resolved by pull request 37501
[https://github.com/apache/spark/pull/37501]

> Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-39926
>                 URL: https://issues.apache.org/jira/browse/SPARK-39926
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Daniel
>            Assignee: Daniel
>            Priority: Major
>             Fix For: 3.4.0
>
>
> How to reproduce:
> {code:sql}
> set spark.sql.parquet.enableVectorizedReader=false;
> create table t(a int) using parquet;
> insert into t values (42);
> alter table t add column b int default 42;
> insert into t values (43, null);
> select * from t;
> {code}
> This should return two rows:
> (42, 42) and (43, NULL)
> But instead the scan misses the inserted NULL value, and returns the existence DEFAULT value of "42" instead:
> (42, 42) and (43, 42).
>  
> This bug happens because the Parquet API calls one of these set* methods in ParquetRowConverter.scala whenever it finds a non-NULL value:
> {code:scala}
> private class RowUpdater(row: InternalRow, ordinal: Int)
> extends ParentContainerUpdater {
>   override def set(value: Any): Unit = row(ordinal) = value
>   override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, value)
>   override def setByte(value: Byte): Unit = row.setByte(ordinal, value)
>   override def setShort(value: Short): Unit = row.setShort(ordinal, value)
>   override def setInt(value: Int): Unit = row.setInt(ordinal, value)
>   override def setLong(value: Long): Unit = row.setLong(ordinal, value)
>   override def setDouble(value: Double): Unit = row.setDouble(ordinal, value)
>   override def setFloat(value: Float): Unit = row.setFloat(ordinal, value)
> }
>  {code}
>  
> But it never calls anything like "setNull()" when encountering a NULL value.
> To fix the bug, we need to know how many columns of data were present in each row of the Parquet data, so we can differentiate between a NULL value and a missing column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org