You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sourabh Badhya (Jira)" <ji...@apache.org> on 2023/01/21 04:58:00 UTC
[jira] [Comment Edited] (HIVE-26755) Wrong results after renaming Parquet column

    [ https://issues.apache.org/jira/browse/HIVE-26755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679395#comment-17679395 ] 

Sourabh Badhya edited comment on HIVE-26755 at 1/21/23 4:57 AM:
----------------------------------------------------------------

The SELECT query returns correct results by using the following config - 
{code:java}
set parquet.column.index.access=true{code}
This will ensure that the columns are read using their positional index in the schema, provided that both table schema & file schema are maintained in the same way (which is in this case).

I wonder if this can be solved deterministically without using positional index of the schema. According to the code, the schema is determined by using column names if positional index is not used. 
[https://github.com/apache/hive/blob/0e61afc181c5d8c41b04525bd34726edbe3fd0d7/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L114]

As stated in the code, if the column name is not part of the file schema, then it just sets the column name of the table schema as the schema of the ParquetRecordReader which is used while fetching the rows. Since file schema does not know about this column name (since it is not part of file schema), it just returns NULL.

cc [~zabetak] 


was (Author: JIRAUSER287127):
This can be solved by using the following config - 
{code:java}
set parquet.column.index.access=true{code}
This will ensure that the columns are read using their positional index in the schema, provided that both table schema & file schema are maintained in the same way (which is in this case).

I wonder if this can be solved deterministically without using positional index of the schema. According to the code, the schema is determined by using column names if positional index is not used. 
[https://github.com/apache/hive/blob/0e61afc181c5d8c41b04525bd34726edbe3fd0d7/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L114]

As stated in the code, if the column name is not part of the file schema, then it just sets the column name of the table schema as the schema of the ParquetRecordReader which is used while fetching the rows. Since file schema does not know about this column name (since it is not part of file schema), it just returns NULL.

cc [~zabetak] 

> Wrong results after renaming Parquet column
> -------------------------------------------
>
>                 Key: HIVE-26755
>                 URL: https://issues.apache.org/jira/browse/HIVE-26755
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, Parquet
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Stamatis Zampetakis
>            Priority: Major
>
> Renaming the column of a Parquet table leads to wrong results when the query uses the renamed column.
> {code:sql}
> create table person (id int, fname string, lname string, age int) stored as parquet;
> insert into person values (1, 'Victor', 'Hugo', 23);
> insert into person values (2, 'Alex', 'Dumas', 38);
> insert into person values (3, 'Marco', 'Pollo', 25);
> select fname from person where age >=25;
> {code}
> ||Correct results||
> |Alex|
> |Marco|
> {code:sql}
> alter table person change column age years_from_birth int;
> select fname from person where years_from_birth >=25;
> {code}
> After renaming the column the query above returns an empty result set.
> {code:sql}
> select years_from_birth from person;
> {code}
> ||Wrong results||
> |NULL|
> |NULL|
> |NULL|
> After renaming the column the query returns the correct number of rows but all filled with nulls.
> The problem is reproducible on current master (commit ae0cabffeaf284a6d2ec13a6993c87770818fbb9).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)