You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jainik Vora (Jira)" <ji...@apache.org> on 2020/08/24 20:30:00 UTC

[jira] [Updated] (HIVE-24066) Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception

     [ https://issues.apache.org/jira/browse/HIVE-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jainik Vora updated HIVE-24066:
-------------------------------
    Priority: Minor  (was: Trivial)

> Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24066
>                 URL: https://issues.apache.org/jira/browse/HIVE-24066
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.3.5
>            Reporter: Jainik Vora
>            Priority: Minor
>
> I created a hive table containing columns with struct data type 
>  
> {code:java}
> CREATE EXTERNAL TABLE abc_dwh.table_on_parquet (
>   `context` struct<`app`:struct<`build`:string, `name`:string, `namespace`:string, `version`:string>, `screen`:struct<`height`:bigint, `width`:bigint>, `timezone`:string>,
>   `messageid` string,
>   `timestamp` string,
>   `userid` string)
> PARTITIONED BY (year string, month string, day string, hour string)
> STORED as PARQUET
> LOCATION 's3://abc/xyz';
>   {code}
>  
> All columns are nullable hence the parquet files read by the table don't always contain all columns. If any file in a partition doesn't have "context.app" struct and if "context.app.version" is queried, Hive throws an exception as below. Same for "context.screen" as well.
>  
> {code:java}
>  Caused by: java.io.IOException: java.lang.RuntimeException: Primitive type appshould not doesn't match typeapp[version]
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
> ... 25 more
> Caused by: java.lang.RuntimeException: Primitive type appshould not doesn't match typeapp[version]
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
> at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
> ... 26 more
>  {code}
>  
> Querying context.app shows as null
> {code:java}
> hive> select context.app from abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and hour='03' limit 5;
> OK
> NULL
> NULL
> NULL
> NULL
> NULL
>   {code}
>  
> As a workaround, I tried querying "context.app.version" only if "context.app" is not null but that also gave the same error.  *To verify the case statement for null check, I ran below query which should produce "0" in result for all columns produced "1".*  Distinct value of context.app for the partition is NULL so ruled out differences in select with limit. Running the same query in SparkSQL provides the correct result. 
> {code:java}
> hive> select case when context.app is null then 0 else 1 end status from abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and hour='03' limit 5;
> OK
> 1
> 1
> 1
> 1
> 1 {code}
> Hive Version used: 2.3.5-amzn-0 (on AWS EMR){color:#888888}
> {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)