You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Venugopal Reddy K (Jira)" <ji...@apache.org> on 2022/12/15 14:47:00 UTC
[jira] [Updated] (HIVE-26862) IndexOutOfBoundsException occurred in stats task during dynamic partition table load when user data for partition column is case sensitive. And few rows are missed in the partition as well.

     [ https://issues.apache.org/jira/browse/HIVE-26862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Venugopal Reddy K updated HIVE-26862:
-------------------------------------
    Description: 
*[Description]* 

java.lang.IndexOutOfBoundsException occurred in stats task during dynamic partition table load. This happens when user data for partition column is case sensitive. And few rows are missed in the partition as well.

*[Steps to reproduce]*

1. Create stage table, load some data into stage table, create partition table and load data into that table from the stage table. data file is attached below.
{code:java}
0: jdbc:hive2://localhost:10000> create database mydb; 0: jdbc:hive2://localhost:10000> use mydb;
{code}
{code:java}
0: jdbc:hive2://localhost:10000> create table stage(num int, name string, category string) row format delimited fields terminated by ',' stored as textfile;
{code}
{code:java}
0: jdbc:hive2://localhost:10000> load data local inpath 'data' into table stage;{code}
{code:java}
0: jdbc:hive2://localhost:10000> select * from stage;
+------------+-------------+---------------+
| stage.num  | stage.name  | stage.category|
+------------+-------------+---------------+
| 1          | apple       | Fruit         |
| 2          | banana      | Fruit         |
| 3          | carrot      | vegetable     |
| 4          | cherry      | Fruit         |
| 5          | potato      | vegetable     |
| 6          | mango       | Fruit         |
| 7          | tomato      | Vegetable     |=>V in vegetable is uppercase here
+------------+-------------+---------------+
7 rows selected (12.979 seconds)
{code}
{code:java}
0: jdbc:hive2://localhost:10000> create table dynpart(num int, name string) partitioned by (category string) row format delimited fields terminated by ',' stored as textfile;{code}
{code:java}
0: jdbc:hive2://localhost:10000> insert into dynpart select * from stage;
INFO  : Compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
INFO  : No Stats for mydb@stage, Columns: num, name, category
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:stage.num, type:int, comment:null), FieldSchema(name:stage.name, type:string, comment:null), FieldSchema(name:stage.category, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 2.967 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez) or using Hive 1.X releases.
INFO  : Query ID = kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72
INFO  : Total jobs = 2
INFO  : Launching Job 1 out of 2
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local729224564_0001
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2022-12-15 19:21:27,285 Stage-1 map = 0%,  reduce = 0%
INFO  : 2022-12-15 19:21:28,321 Stage-1 map = 100%,  reduce = 0%
INFO  : 2022-12-15 19:21:29,359 Stage-1 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local729224564_0001
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table mydb.dynpart partition (category=null) from file:/tmp/warehouse/external/mydb.db/dynpart/.hive-staging_hive_2022-12-15_19-21-12_997_3457134057632526413-1/-ext-10000
INFO  : 


INFO  : 	 Time taken to load dynamic partitions: 33.657 seconds
INFO  : 	 Time taken for adding to write entity : 0.003 seconds
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local1246165356_0002
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2022-12-15 19:22:13,511 Stage-3 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local1246165356_0002
INFO  : Starting task [Stage-2:STATS] in serial mode
INFO  : Executing stats task
INFO  : Partition {category=Fruit} stats: [numFiles=1, numRows=4, totalSize=34, rawDataSize=30, numFilesErasureCoded=0]
INFO  : Partition {category=Vegetable} stats: [numFiles=1, numRows=1, totalSize=18, rawDataSize=8, numFilesErasureCoded=0]
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 0 msec
INFO  : Completed executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 452.037 seconds
Error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 (state=08S01,code=1){code}
 

2. Warehouse directory dynpart table path
{code:java}
kvenureddy@192 dynpart % pwd
/tmp/warehouse/external/mydb.db/dynpart
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % pwd
/tmp/warehouse/external/mydb.db/dynpart
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % cd category=Vegetable 
kvenureddy@192 category=Vegetable % ls
000000_0
kvenureddy@192 category=Vegetable % cat 000000_0 
5,potato
3,carrot => Only 2 rows present. row(7,tomato) is missing in this partition.
kvenureddy@192 category=Vegetable % cd ..
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % cd category=Fruit 
kvenureddy@192 category=Fruit % ls
000000_0
kvenureddy@192 category=Fruit % cat 000000_0 
6,mango
4,cherry
2,banana
1,apple
kvenureddy@192 category=Fruit % 
{code}
 

*[Exception Info]* 

Complete log file is attached.
{code:java}
2022-12-15T19:28:48,003 ERROR [HiveServer2-Background-Pool: Thread-123] metastore.RetryingHMSHandler: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at org.apache.hadoop.hive.metastore.HMSHandler.updatePartColumnStatsWithMerge(HMSHandler.java:9194)
    at org.apache.hadoop.hive.metastore.HMSHandler.set_aggr_stats_for(HMSHandler.java:9149)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:146)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at com.sun.proxy.$Proxy31.set_aggr_stats_for(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.setPartitionColumnStatistics(HiveMetaStoreClient.java:3307)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.setPartitionColumnStatistics(SessionHiveMetaStoreClient.java:566)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:218)
    at com.sun.proxy.$Proxy32.setPartitionColumnStatistics(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.setPartitionColumnStatistics(Hive.java:5677)
    at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:221)
    at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:94)
    at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
    at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:354)
    at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:327)
    at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:244)
    at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:105)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:370)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:205)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
    at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
    at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
    at org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:90)
    at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
    at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-12-15T19:28:48,004 ERROR [HiveServer2-Background-Pool: Thread-123] exec.StatsTask: Failed to run stats task
{code}

  was:
*[Description]* 

java.lang.IndexOutOfBoundsException occurred in stats task during dynamic partition table load. This happens when user data for partition column is case sensitive. And few rows are missed in the partition as well.

 

 

*[Steps to reproduce]*

1. Create stage table, load some data into stage table, create partition table and load data into that table from the stage table. data file is attached below.
{code:java}
0: jdbc:hive2://localhost:10000> create database mydb; 0: jdbc:hive2://localhost:10000> use mydb;
{code}
{code:java}
0: jdbc:hive2://localhost:10000> create table stage(num int, name string, category string) row format delimited fields terminated by ',' stored as textfile;
{code}
{code:java}
0: jdbc:hive2://localhost:10000> load data local inpath 'data' into table stage;{code}
 
{code:java}
0: jdbc:hive2://localhost:10000> select * from stage;
+------------+-------------+---------------+
| stage.num  | stage.name  | stage.category|
+------------+-------------+---------------+
| 1          | apple       | Fruit         |
| 2          | banana      | Fruit         |
| 3          | carrot      | vegetable     |
| 4          | cherry      | Fruit         |
| 5          | potato      | vegetable     |
| 6          | mango       | Fruit         |
| 7          | tomato      | Vegetable     |=>V in vegetable is uppercase here
+------------+-------------+---------------+
7 rows selected (12.979 seconds)
{code}
 

 
{code:java}
0: jdbc:hive2://localhost:10000> create table dynpart(num int, name string) partitioned by (category string) row format delimited fields terminated by ',' stored as textfile;{code}
 

 
{code:java}
0: jdbc:hive2://localhost:10000> insert into dynpart select * from stage;
INFO  : Compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
INFO  : No Stats for mydb@stage, Columns: num, name, category
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:stage.num, type:int, comment:null), FieldSchema(name:stage.name, type:string, comment:null), FieldSchema(name:stage.category, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 2.967 seconds
INFO  : Operation QUERY obtained 0 locks
INFO  : Executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez) or using Hive 1.X releases.
INFO  : Query ID = kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72
INFO  : Total jobs = 2
INFO  : Launching Job 1 out of 2
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local729224564_0001
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2022-12-15 19:21:27,285 Stage-1 map = 0%,  reduce = 0%
INFO  : 2022-12-15 19:21:28,321 Stage-1 map = 100%,  reduce = 0%
INFO  : 2022-12-15 19:21:29,359 Stage-1 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local729224564_0001
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table mydb.dynpart partition (category=null) from file:/tmp/warehouse/external/mydb.db/dynpart/.hive-staging_hive_2022-12-15_19-21-12_997_3457134057632526413-1/-ext-10000
INFO  : 


INFO  : 	 Time taken to load dynamic partitions: 33.657 seconds
INFO  : 	 Time taken for adding to write entity : 0.003 seconds
INFO  : Launching Job 2 out of 2
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_local1246165356_0002
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://localhost:8080/
INFO  : Job running in-process (local Hadoop)
INFO  : 2022-12-15 19:22:13,511 Stage-3 map = 100%,  reduce = 100%
INFO  : Ended Job = job_local1246165356_0002
INFO  : Starting task [Stage-2:STATS] in serial mode
INFO  : Executing stats task
INFO  : Partition {category=Fruit} stats: [numFiles=1, numRows=4, totalSize=34, rawDataSize=30, numFilesErasureCoded=0]
INFO  : Partition {category=Vegetable} stats: [numFiles=1, numRows=1, totalSize=18, rawDataSize=8, numFilesErasureCoded=0]
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 0 msec
INFO  : Completed executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 452.037 seconds
Error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 (state=08S01,code=1){code}
 

 

2. Warehouse directory dynpart table path

 
{code:java}
kvenureddy@192 dynpart % pwd
/tmp/warehouse/external/mydb.db/dynpart
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % pwd
/tmp/warehouse/external/mydb.db/dynpart
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % cd category=Vegetable 
kvenureddy@192 category=Vegetable % ls
000000_0
kvenureddy@192 category=Vegetable % cat 000000_0 
5,potato
3,carrot => Only 2 rows present. row(7,tomato) is missing in this partition.
kvenureddy@192 category=Vegetable % cd ..
kvenureddy@192 dynpart % ls
category=Fruit category=Vegetable
kvenureddy@192 dynpart % cd category=Fruit 
kvenureddy@192 category=Fruit % ls
000000_0
kvenureddy@192 category=Fruit % cat 000000_0 
6,mango
4,cherry
2,banana
1,apple
kvenureddy@192 category=Fruit % 
{code}
 

 

*[Exception Info]* 

Complete log file is attached.

 
{code:java}
2022-12-15T19:28:48,003 ERROR [HiveServer2-Background-Pool: Thread-123] metastore.RetryingHMSHandler: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at org.apache.hadoop.hive.metastore.HMSHandler.updatePartColumnStatsWithMerge(HMSHandler.java:9194)
    at org.apache.hadoop.hive.metastore.HMSHandler.set_aggr_stats_for(HMSHandler.java:9149)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:146)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at com.sun.proxy.$Proxy31.set_aggr_stats_for(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.setPartitionColumnStatistics(HiveMetaStoreClient.java:3307)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.setPartitionColumnStatistics(SessionHiveMetaStoreClient.java:566)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:218)
    at com.sun.proxy.$Proxy32.setPartitionColumnStatistics(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.setPartitionColumnStatistics(Hive.java:5677)
    at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:221)
    at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:94)
    at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
    at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:354)
    at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:327)
    at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:244)
    at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:105)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:370)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:205)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
    at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
    at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
    at org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:90)
    at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
    at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-12-15T19:28:48,004 ERROR [HiveServer2-Background-Pool: Thread-123] exec.StatsTask: Failed to run stats task
{code}
 

 

 


> IndexOutOfBoundsException occurred in stats task during dynamic partition table load when user data for partition column is case sensitive. And few rows are missed in the partition as well.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-26862
>                 URL: https://issues.apache.org/jira/browse/HIVE-26862
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Venugopal Reddy K
>            Priority: Major
>         Attachments: data, hive.log
>
>
> *[Description]* 
> java.lang.IndexOutOfBoundsException occurred in stats task during dynamic partition table load. This happens when user data for partition column is case sensitive. And few rows are missed in the partition as well.
> *[Steps to reproduce]*
> 1. Create stage table, load some data into stage table, create partition table and load data into that table from the stage table. data file is attached below.
> {code:java}
> 0: jdbc:hive2://localhost:10000> create database mydb; 0: jdbc:hive2://localhost:10000> use mydb;
> {code}
> {code:java}
> 0: jdbc:hive2://localhost:10000> create table stage(num int, name string, category string) row format delimited fields terminated by ',' stored as textfile;
> {code}
> {code:java}
> 0: jdbc:hive2://localhost:10000> load data local inpath 'data' into table stage;{code}
> {code:java}
> 0: jdbc:hive2://localhost:10000> select * from stage;
> +------------+-------------+---------------+
> | stage.num  | stage.name  | stage.category|
> +------------+-------------+---------------+
> | 1          | apple       | Fruit         |
> | 2          | banana      | Fruit         |
> | 3          | carrot      | vegetable     |
> | 4          | cherry      | Fruit         |
> | 5          | potato      | vegetable     |
> | 6          | mango       | Fruit         |
> | 7          | tomato      | Vegetable     |=>V in vegetable is uppercase here
> +------------+-------------+---------------+
> 7 rows selected (12.979 seconds)
> {code}
> {code:java}
> 0: jdbc:hive2://localhost:10000> create table dynpart(num int, name string) partitioned by (category string) row format delimited fields terminated by ',' stored as textfile;{code}
> {code:java}
> 0: jdbc:hive2://localhost:10000> insert into dynpart select * from stage;
> INFO  : Compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
> INFO  : No Stats for mydb@stage, Columns: num, name, category
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:stage.num, type:int, comment:null), FieldSchema(name:stage.name, type:string, comment:null), FieldSchema(name:stage.category, type:string, comment:null)], properties:null)
> INFO  : Completed compiling command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 2.967 seconds
> INFO  : Operation QUERY obtained 0 locks
> INFO  : Executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72): insert into dynpart select * from stage
> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez) or using Hive 1.X releases.
> INFO  : Query ID = kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72
> INFO  : Total jobs = 2
> INFO  : Launching Job 1 out of 2
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:1
> INFO  : Submitting tokens for job: job_local729224564_0001
> INFO  : Executing with tokens: []
> INFO  : The url to track the job: http://localhost:8080/
> INFO  : Job running in-process (local Hadoop)
> INFO  : 2022-12-15 19:21:27,285 Stage-1 map = 0%,  reduce = 0%
> INFO  : 2022-12-15 19:21:28,321 Stage-1 map = 100%,  reduce = 0%
> INFO  : 2022-12-15 19:21:29,359 Stage-1 map = 100%,  reduce = 100%
> INFO  : Ended Job = job_local729224564_0001
> INFO  : Starting task [Stage-0:MOVE] in serial mode
> INFO  : Loading data to table mydb.dynpart partition (category=null) from file:/tmp/warehouse/external/mydb.db/dynpart/.hive-staging_hive_2022-12-15_19-21-12_997_3457134057632526413-1/-ext-10000
> INFO  : 
> INFO  : 	 Time taken to load dynamic partitions: 33.657 seconds
> INFO  : 	 Time taken for adding to write entity : 0.003 seconds
> INFO  : Launching Job 2 out of 2
> INFO  : Starting task [Stage-3:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 1
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:1
> INFO  : Submitting tokens for job: job_local1246165356_0002
> INFO  : Executing with tokens: []
> INFO  : The url to track the job: http://localhost:8080/
> INFO  : Job running in-process (local Hadoop)
> INFO  : 2022-12-15 19:22:13,511 Stage-3 map = 100%,  reduce = 100%
> INFO  : Ended Job = job_local1246165356_0002
> INFO  : Starting task [Stage-2:STATS] in serial mode
> INFO  : Executing stats task
> INFO  : Partition {category=Fruit} stats: [numFiles=1, numRows=4, totalSize=34, rawDataSize=30, numFilesErasureCoded=0]
> INFO  : Partition {category=Vegetable} stats: [numFiles=1, numRows=1, totalSize=18, rawDataSize=8, numFilesErasureCoded=0]
> ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
> INFO  : MapReduce Jobs Launched: 
> INFO  : Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> INFO  : Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 0 msec
> INFO  : Completed executing command(queryId=kvenureddy_20221215192112_ae2e55b5-6b1f-402d-b79f-874261a27b72); Time taken: 452.037 seconds
> Error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask. java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 (state=08S01,code=1){code}
>  
> 2. Warehouse directory dynpart table path
> {code:java}
> kvenureddy@192 dynpart % pwd
> /tmp/warehouse/external/mydb.db/dynpart
> kvenureddy@192 dynpart % ls
> category=Fruit category=Vegetable
> kvenureddy@192 dynpart % pwd
> /tmp/warehouse/external/mydb.db/dynpart
> kvenureddy@192 dynpart % ls
> category=Fruit category=Vegetable
> kvenureddy@192 dynpart % cd category=Vegetable 
> kvenureddy@192 category=Vegetable % ls
> 000000_0
> kvenureddy@192 category=Vegetable % cat 000000_0 
> 5,potato
> 3,carrot => Only 2 rows present. row(7,tomato) is missing in this partition.
> kvenureddy@192 category=Vegetable % cd ..
> kvenureddy@192 dynpart % ls
> category=Fruit category=Vegetable
> kvenureddy@192 dynpart % cd category=Fruit 
> kvenureddy@192 category=Fruit % ls
> 000000_0
> kvenureddy@192 category=Fruit % cat 000000_0 
> 6,mango
> 4,cherry
> 2,banana
> 1,apple
> kvenureddy@192 category=Fruit % 
> {code}
>  
> *[Exception Info]* 
> Complete log file is attached.
> {code:java}
> 2022-12-15T19:28:48,003 ERROR [HiveServer2-Background-Pool: Thread-123] metastore.RetryingHMSHandler: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
>     at java.util.ArrayList.rangeCheck(ArrayList.java:659)
>     at java.util.ArrayList.get(ArrayList.java:435)
>     at org.apache.hadoop.hive.metastore.HMSHandler.updatePartColumnStatsWithMerge(HMSHandler.java:9194)
>     at org.apache.hadoop.hive.metastore.HMSHandler.set_aggr_stats_for(HMSHandler.java:9149)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:146)
>     at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>     at com.sun.proxy.$Proxy31.set_aggr_stats_for(Unknown Source)
>     at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.setPartitionColumnStatistics(HiveMetaStoreClient.java:3307)
>     at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.setPartitionColumnStatistics(SessionHiveMetaStoreClient.java:566)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:218)
>     at com.sun.proxy.$Proxy32.setPartitionColumnStatistics(Unknown Source)
>     at org.apache.hadoop.hive.ql.metadata.Hive.setPartitionColumnStatistics(Hive.java:5677)
>     at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:221)
>     at org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:94)
>     at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
>     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
>     at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>     at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:354)
>     at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:327)
>     at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:244)
>     at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:105)
>     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:370)
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:205)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
>     at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
>     at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
>     at org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:90)
>     at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>     at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 2022-12-15T19:28:48,004 ERROR [HiveServer2-Background-Pool: Thread-123] exec.StatsTask: Failed to run stats task
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)