You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jainik Vora <ja...@gmail.com> on 2020/07/28 23:46:15 UTC

Hive on Parquet - Unexpected/Incorrect behavior for Struct columns not present in parquet file

I created a hive table containing columns with struct data type

CREATE EXTERNAL TABLE abc_dwh.table_on_parquet (
  `context` struct<`app`:struct<`build`:string, `name`:string,
`namespace`:string, `version`:string>, `screen`:struct<`height`:bigint,
`width`:bigint>, `timezone`:string>,
  `messageid` string,
  `timestamp` string,
  `userid` string)
PARTITIONED BY (year string, month string, day string, hour string)
STORED as PARQUET
LOCATION 's3://abc/xyz';

All columns are nullable hence the parquet files read by the table don't
always contain all columns. If any file in a partition doesn't have
"context.app" struct and if "context.app.version" is queried, I see an
exception as below. Same for "context.screen" as well.

Caused by: java.io.IOException: java.lang.RuntimeException: Primitive type
appshould not doesn't match typeapp[version]
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 25 more
Caused by: java.lang.RuntimeException: Primitive type appshould not doesn't
match typeapp[version]
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376)
... 26 more

Querying context.app shows as null
hive> select context.app from abc_dwh.table_on_parquet where year=2020 and
month='07' and day=26 and hour='03' limit 5;
OK
NULL
NULL
NULL
NULL
NULL

As a workaround, I tried querying "context.app.version" only if
"context.app" is not null but that also gave the same error.  *To verify
the case statement for null check, I ran below query which should produce
"0" in result for all columns produced "1".*  Distinct value of context.app
for the partition is NULL so ruled out differences in select with limit.
Running the same query in SparkSQL provides the correct result.
hive> select case when context.app is null then 0 else 1 end status from
abc_dwh.table_on_parquet where year=2020 and month='07' and day=26 and
hour='03' limit 5;
OK
1
1
1
1
1

*Is there a solution to querying a nullable struct column reliably*?
Appreciate any help on this.

Hive Version used: 2.3.5-amzn-0 (on AWS EMR)

-- 


Thanks & Regards,

*Jainik Vora*