You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Shivam Sharma (Jira)" <ji...@apache.org> on 2021/05/06 14:00:07 UTC
[jira] [Comment Edited] (HIVE-24066) Hive query on parquet data
should identify if column is not present in file schema and show NULL value
instead of Exception
[ https://issues.apache.org/jira/browse/HIVE-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340225#comment-17340225 ]
Shivam Sharma edited comment on HIVE-24066 at 5/6/21, 1:59 PM:
---------------------------------------------------------------
I am also facing the same issue. Can we get an update here?
Can we get some workarounds here other than creating a new table?
was (Author: shivamsharma):
I am also facing the same issue. Can we get update here?
> Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-24066
> URL: https://issues.apache.org/jira/browse/HIVE-24066
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.3.5, 3.1.2
> Reporter: Jainik Vora
> Priority: Major
> Attachments: day_01.snappy.parquet
>
>
> I created a hive table containing columns with struct data type
>
> {code:java}
> CREATE EXTERNAL TABLE test_dwh.sample_parquet_table (
> `context` struct<
> `app`: struct<
> `build`: string,
> `name`: string,
> `namespace`: string,
> `version`: string
> >,
> `device`: struct<
> `adtrackingenabled`: boolean,
> `advertisingid`: string,
> `id`: string,
> `manufacturer`: string,
> `model`: string,
> `type`: string
> >,
> `locale`: string,
> `library`: struct<
> `name`: string,
> `version`: string
> >,
> `os`: struct<
> `name`: string,
> `version`: string
> >,
> `screen`: struct<
> `height`: bigint,
> `width`: bigint
> >,
> `network`: struct<
> `carrier`: string,
> `cellular`: boolean,
> `wifi`: boolean
> >,
> `timezone`: string,
> `userAgent`: string
> >
> ) PARTITIONED BY (day string)
> STORED as PARQUET
> LOCATION 's3://xyz/events'{code}
>
> All columns are nullable hence the parquet files read by the table don't always contain all columns. If any file in a partition doesn't have "context.os" struct and if "context.os.name" is queried, Hive throws an exception as below. Same for "context.screen" as well.
>
> {code:java}
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with exception java.io.IOException:java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with exception java.io.IOException:java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]java.io.IOException: java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
> at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379)
> at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
> at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) ... 16 more{code}
>
> Querying context.os shows as null
> {code:java}
> hive> select context.os from test_dwh.sample_parquet_table where day='01' limit 5;
> OK
> NULL
> NULL
> NULL
> NULL
> NULL
> {code}
>
> As a workaround, I tried querying "context.os.name" only if "context.os" is not null but that also gave the same error. *To verify the case statement for null check, I ran below query which should produce "0" in result for all columns produced "1"*. Distinct value of context.os for the partition is NULL so ruled out differences in select with limit. Running the same query in SparkSQL provides the correct result.
> {code:java}
> hive> select case when context.os is null then 0 else 1 end status from test_dwh.sample_parquet_table where year=2020 and month='07' and day=26 and hour='03' limit 5;
> OK
> 1
> 1
> 1
> 1
> 1 {code}
> Able to reproduce this on Hive 2.x and 3.x - we tested on Hive-2.3.5, Hive-3.1.4 (HDP cluster).
>
> To reproduce,
> 1. Copy the attached day_01.snappy.parquet to an S3 location "s3://<bucket_name>/events/day=01"
> 2. Create the table {{test_dwh.sample_parquet_table}}
> 3. Run "{{msck repair table test_dwh.sample_parquet_table"}}
> 4. Run this query "{{select context.os.name from test_dwh.sample_parquet_table;"}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)