You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Shivam Sharma (Jira)" <ji...@apache.org> on 2021/05/06 13:59:00 UTC

[jira] [Commented] (HIVE-24066) Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception

    [ https://issues.apache.org/jira/browse/HIVE-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340225#comment-17340225 ] 

Shivam Sharma commented on HIVE-24066:
--------------------------------------

I am also facing the same issue. Can we get update here?

> Hive query on parquet data should identify if column is not present in file schema and show NULL value instead of Exception
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24066
>                 URL: https://issues.apache.org/jira/browse/HIVE-24066
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.3.5, 3.1.2
>            Reporter: Jainik Vora
>            Priority: Major
>         Attachments: day_01.snappy.parquet
>
>
> I created a hive table containing columns with struct data type 
>   
> {code:java}
> CREATE EXTERNAL TABLE test_dwh.sample_parquet_table (
>   `context` struct<
>         `app`: struct<
>             `build`: string,
>             `name`: string,
>             `namespace`: string,
>             `version`: string
>             >,
>         `device`: struct<
>             `adtrackingenabled`: boolean,
>             `advertisingid`: string,
>             `id`: string,
>             `manufacturer`: string,
>             `model`: string,
>             `type`: string
>             >,
>         `locale`: string,
>         `library`: struct<
>             `name`: string,
>             `version`: string
>             >,
>         `os`: struct<
>             `name`: string,
>             `version`: string
>             >,
>         `screen`: struct<
>             `height`: bigint,
>             `width`: bigint
>             >,
>         `network`: struct<
>             `carrier`: string,
>             `cellular`: boolean,
>             `wifi`: boolean
>              >,
>         `timezone`: string,
>         `userAgent`: string
>     >
> ) PARTITIONED BY (day string)
> STORED as PARQUET
> LOCATION 's3://xyz/events'{code}
>  
>  All columns are nullable hence the parquet files read by the table don't always contain all columns. If any file in a partition doesn't have "context.os" struct and if "context.os.name" is queried, Hive throws an exception as below. Same for "context.screen" as well.
>   
> {code:java}
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with exception java.io.IOException:java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]
> 2020-10-23T00:44:10,496 ERROR [db58bfe6-d0ca-4233-845a-8a10916c3ff1 main([])]: CliDriver (SessionState.java:printError(1126)) - Failed with exception java.io.IOException:java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name]java.io.IOException: java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name] 
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
>   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
>   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
>   at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>   at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
>   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: java.lang.RuntimeException: Primitive type osshould not doesn't match typeos[name] 
>   at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:330) 
>   at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.projectLeafTypes(DataWritableReadSupport.java:322) 
>   at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getProjectedSchema(DataWritableReadSupport.java:249)
>   at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:379) 
>   at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84)
>   at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:75)
>   at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>   at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>   at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459) ... 16 more{code}
>  
>  Querying context.os shows as null
> {code:java}
> hive> select context.os from test_dwh.sample_parquet_table where day='01' limit 5;
> OK
> NULL
> NULL
> NULL
> NULL
> NULL
>   {code}
>  
>  As a workaround, I tried querying "context.os.name" only if "context.os" is not null but that also gave the same error. *To verify the case statement for null check, I ran below query which should produce "0" in result for all columns produced "1"*. Distinct value of context.os for the partition is NULL so ruled out differences in select with limit. Running the same query in SparkSQL provides the correct result.
> {code:java}
> hive> select case when context.os is null then 0 else 1 end status from test_dwh.sample_parquet_table where year=2020 and month='07' and day=26 and hour='03' limit 5;
> OK
> 1
> 1
> 1
> 1
> 1 {code}
> Able to reproduce this on Hive 2.x and 3.x - we tested on Hive-2.3.5, Hive-3.1.4 (HDP cluster).
>  
> To reproduce, 
>  1. Copy the attached day_01.snappy.parquet to an S3 location "s3://<bucket_name>/events/day=01"
>  2. Create the table {{test_dwh.sample_parquet_table}}
>  3. Run "{{msck repair table test_dwh.sample_parquet_table"}}
>  4. Run this query "{{select context.os.name from test_dwh.sample_parquet_table;"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)