You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Aditya Shah (JIRA)" <ji...@apache.org> on 2019/03/18 05:10:00 UTC
[jira] [Updated] (HIVE-17404) Orc split generation cache does not handle files without file tail

     [ https://issues.apache.org/jira/browse/HIVE-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aditya Shah updated HIVE-17404:
-------------------------------
          Attachment: HIVE-17404.patch
    Target Version/s:   (was: 3.0.0, 2.4.0)
              Status: Patch Available  (was: Open)

Have submitted a patch which Adds a check for ORC bytes in Orctail before putting it in the local cache. This issue was faced because in HIVE-16133 we minimize the tail data stored in the cache. This cause a call to extractTails which rebuilds the OrcTail while using it. This further causes a check for footer and results in an error being thrown. Because for old orc files when the tail is not present we check the head for the “ORC” text, but in the case where we just have a tail as in this call, it causes an exception.

cc [~prasanth_j] [~rajesh.balamohan] [~andrewom]

> Orc split generation cache does not handle files without file tail
> ------------------------------------------------------------------
>
>                 Key: HIVE-17404
>                 URL: https://issues.apache.org/jira/browse/HIVE-17404
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.4.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Aditya Shah
>            Priority: Critical
>         Attachments: HIVE-17404.patch
>
>
> Some old files do not have Orc FileTail. If file tail does not exist, split generation should fallback to old way of storing footers. 
> This can result in exceptions like below
> {code}
> ORC split generation failed with exception: Malformed ORC file. Invalid postscript length 9
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1735)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1822)
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:450)
> 	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:569)
> 	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
> 	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
> 	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
> 	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
> 	at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.orc.FileFormatException: Malformed ORC file. Invalid postscript length 9
> 	at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:297)
> 	at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:470)
> 	at org.apache.hadoop.hive.ql.io.orc.LocalCache.getAndValidate(LocalCache.java:103)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.getSplits(OrcInputFormat.java:804)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.runGetSplitsSync(OrcInputFormat.java:922)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$ETLSplitStrategy.generateSplitWork(OrcInputFormat.java:891)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.scheduleSplits(OrcInputFormat.java:1763)
> 	at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1707)
> 	... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)