You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ganesha Shreedhara (Jira)" <ji...@apache.org> on 2021/09/01 06:20:00 UTC
[jira] [Created] (HIVE-25494) Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema

Ganesha Shreedhara created HIVE-25494:
-----------------------------------------

             Summary: Hive query fails with IndexOutOfBoundsException when a struct type column's field is missing in parquet file schema but present in table schema
                 Key: HIVE-25494
                 URL: https://issues.apache.org/jira/browse/HIVE-25494
             Project: Hive
          Issue Type: Bug
            Reporter: Ganesha Shreedhara
         Attachments: test-struct.parquet

When a struct type column's field is missing in parquet file schema but present in table schema and columns are accessed by names, the requestedSchema getting sent from Hive to Parquet storage layer has type even for missing field since we always add type as primitive type if a field is missing in file schema ([Ref|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L130]).] On a parquet side, this missing field gets pruned and since this field belongs to struct type, it ends creating a GroupColumnIO without any children. This causes query to fail with IndexOutOfBoundsException, stack trace is given below.

 
{code:java}
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file test-struct.parquet
 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
 at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:98)
 at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:60)
 at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75)
 at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:695)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:333)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:459)
 ... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
 at java.util.ArrayList.rangeCheck(ArrayList.java:657)
 at java.util.ArrayList.get(ArrayList.java:433)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
 at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
 at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
 at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
 at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
 at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214) {code}
 

Steps to reproduce:

 
{code:java}
CREATE TABLE parquet_struct_test(
`parent` struct<child:string,extracol:string> COMMENT '',
`toplevel` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
 
-- Use the attached test-struct.parquet data file to load data to this table

LOAD DATA LOCAL INPATH 'test-struct.parquet' INTO TABLE parquet_struct_test;

hive> select parent.extracol, toplevel from parquet_struct_test;
OK
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://${host}/user/hive/warehouse/parquet_struct_test/test-struct.parquet 
{code}
 

Same query works fine in the following scenarios:


1) Accessing parquet file columns by index instead of names
{code:java}
hive> set parquet.column.index.access=true;
hive>  select parent.extracol, toplevel from parquet_struct_test;
OK
NULL toplevel{code}
 

2) When VectorizedParquetRecordReader is used
{code:java}
hive> set hive.fetch.task.conversion=none;
hive> select parent.extracol, toplevel from parquet_struct_test;
Query ID = hadoop_20210831154424_19aa6f7f-ab72-4c1e-ae37-4f985e72fce9Total 
jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.06 s----------------------------------------------------------------------------------------------
OK
NULL toplevel{code}
 

3) Create a copy of the same table and run the same query on the newly created table. 
{code:java}
hive> create table parquet_struct_test_copy like parquet_struct_test;
OK
hive> insert into parquet_struct_test_copy select * from parquet_struct_test;
Query ID = hadoop_20210831154709_954d0abf-d713-498e-8696-27fb9c457dc8Total jobs = 1Launching Job 1 out of 1Status: Running (Executing on YARN cluster with App id application_1630412697229_0031)
----------------------------------------------------------------------------------------------        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 3.81 s----------------------------------------------------------------------------------------------
Loading data to table default.parquet_struct_test_copy
OK
hive> select parent.extracol, toplevel from parquet_struct_test_copy;
OK
NULL toplevel{code}
 

Also, this issue doesn't exist when only missing struct type column's field is selected or all the fields in table are selected. This issue exists only when combination of missing struct type column's field and another existing column are selected.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)