You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ganesha Shreedhara (Jira)" <ji...@apache.org> on 2021/09/01 06:33:00 UTC

[jira] [Commented] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

    [ https://issues.apache.org/jira/browse/HIVE-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407866#comment-17407866 ] 

Ganesha Shreedhara commented on HIVE-25494:
-------------------------------------------

I verified that this issue doesn't exist when requestedSchema just has the field types that are present in the file schema. But, noticed that VectorizedParquetRecordReader gets all the fields and creates a VectorizedDummyColumnReader when a field is not present in file schema. When columns are accessed by names, should we consider only the fields that are present in the file schema to be present in requestedSchema and return null for the rest of the columns that are selected but missing in file schema? 

 

> Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25494
>                 URL: https://issues.apache.org/jira/browse/HIVE-25494
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ganesha Shreedhara
>            Priority: Major
>         Attachments: test-struct.parquet
>
>
> When a struct type column's field is missing in parquet file schema but present in table schema and columns are accessed by names, the requestedSchema getting sent from Hive to Parquet storage layer has type even for missing field since we always add type as primitive type if a field is missing in file schema ([Ref|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).] On a parquet side, this missing field gets pruned and since this field belongs to struct type, it ends creating a GroupColumnIO without any children. This causes query to fail with IndexOutOfBoundsException, stack trace is given below.
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file test-struct.parquet
>  at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>  at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>  at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
>  at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
>  at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
>  at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
>  at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
>  at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
>  ... 15 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
>  at java.util.ArrayList.get(ArrayList.java:433)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
>  at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
>  at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
>  at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
>  at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
>  at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
>  at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) {code}
>  
> Steps to reproduce:
>  
> {code:java}
> CREATE TABLE parquet_struct_test(
> `parent` struct<child:string,extracol:string> COMMENT '',
> `toplevel` string COMMENT '')
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
>  
> -- Use the attached test-struct.parquet data file to load data to this table
> LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> OK
> Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
> {code}
>  
> Same query works fine in the following scenarios:
> 1) Accessing parquet file columns by index instead of names
> {code:java}
> hive> set parquet.column.index.access=true;
> hive>  select parent.extracol, toplevel from parquet_struct_test;
> OK
> NULL toplevel{code}
>  
> 2) When VectorizedParquetRecordReader is used
> {code:java}
> hive> set hive.fetch.task.conversion=none;
> hive> select parent.extracol, toplevel from parquet_struct_test;
> Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
> jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 s----------------------------------------------------------------------------------------------
> OK
> NULL toplevel{code}
>  
> 3) Create a copy of the same table and run the same query on the newly created table. 
> {code:java}
> hive> create table parquet_struct_test_copy like parquet_struct_test;
> OK
> hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
> Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
> ----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
> VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 s----------------------------------------------------------------------------------------------
> Loading data to table default.parquet_struct_test_copy
> OK
> hive> select parent.extracol, toplevel from parquet_struct_test_copy;
> OK
> NULL toplevel{code}
>  
> Also, this issue doesn't exist when only missing struct type column's field is selected or all the fields in table are selected. This issue exists only when combination of missing struct type column's field and another existing column are selected.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)