You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/24 13:59:00 UTC
[jira] [Work logged] (HIVE-23758) OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch

     [ https://issues.apache.org/jira/browse/HIVE-23758?focusedWorklogId=450454&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-450454 ]

ASF GitHub Bot logged work on HIVE-23758:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Jun/20 13:58
            Start Date: 24/Jun/20 13:58
    Worklog Time Spent: 10m 
      Work Description: abstractdog opened a new pull request #1174:
URL: https://github.com/apache/hive/pull/1174


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-XXXXX: Fix a typo in YYY)
   For more details, please see https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 450454)
    Remaining Estimate: 0h
            Time Spent: 10m

> OrcInputFormat.getSargColumnNames might be more failsafe in case of schema mismatch
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-23758
>                 URL: https://issues.apache.org/jira/browse/HIVE-23758
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: orc_dump.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There was a customer case, where a bucket file was somehow placed into a partition directory, which contained another bucket file with valid acid schema (refer  [^orc_dump.log]  for details), and query failed while split generation with below error at [this line|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L497]
> {code}
> Caused by: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1871)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1959)
>         at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>         at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>         at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>         at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>         at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>         at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>         at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>         at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>         ... 4 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865)
>         ... 17 more
> Caused by: java.lang.IndexOutOfBoundsException: Index: 6, Size: 6
>         at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>         at java.util.ArrayList.get(ArrayList.java:429)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSargColumnNames(OrcInputFormat.java:482)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:539)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.extractNeededColNames(OrcInputFormat.java:534)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.access$2900(OrcInputFormat.java:158)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1556)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2700(OrcInputFormat.java:1337)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1522)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1519)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1519)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1337)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> {code}
> haven't figured out the origin of the second, empty file, but a sanity check of length helped to skip this issue by ignoring that file while split generation, which I'm about to try out in the first version of the pull request:
> in tez app logs after the patch:
> {code}
> 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: possible schema mismatch, asked for column with index:6. column but there is only 6 types defined (isOriginal: false, originalColumnNames.length: 1), cannot get sarg col names...
> 2020-06-24 13:13:21,331 [WARN] [ORC_GET_SPLITS #2] |orc.OrcInputFormat|: Skipping split elimination for hdfs://ns1/warehouse/tablespace/managed/hive/bdaa28846/cda_date=20200601/cda_job_name=core_base/base_0000001/bucket_00001 as column names is null
> {code}
> where bucket_00001 was the second, problematic file, so the patch helped split generation recover from this strange state...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)