You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Pankit Thapar (JIRA)" <ji...@apache.org> on 2014/10/05 00:38:34 UTC

[jira] [Updated] (HIVE-8137) Empty ORC file handling

     [ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pankit Thapar updated HIVE-8137:
--------------------------------
    Attachment: HIVE-8137.patch

Current Logic
==============
CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim which is a child class for CombinFileInputFormat (in hadoop).
CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which creates splits w/o checking for the file size. So, as a result we 
get combineFileSplits which have empty files. 

Issue with the current logic
=============================
Existence of empty files is not correct for ORC files since the format requires certain things like post-scrips to be present in the file.
this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to access the post-script which is not present in the empty file.

Fix
====
1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that when CombineFileInputFormat.getsplits() calls, listStatus(),
it ends up calling CombineFileInputFormatShim.listStatus() which has the logic for skipping empty Files when creating splits.

2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator.

Testing
=======
Added two unit tests to test the the two fixes.


> Empty ORC file handling
> -----------------------
>
>                 Key: HIVE-8137
>                 URL: https://issues.apache.org/jira/browse/HIVE-8137
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 0.13.1
>            Reporter: Pankit Thapar
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8137.patch
>
>
> Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script
> which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty 
> or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor.
> Code Snippet : 
> //get length of PostScript
> int psLen = buffer.get(readSize - 1) & 0xff; 
> In the above code, readSize for an empty file is zero.
> I see that ensureOrcFooter() method performs some sanity checks for footer , 
> so, either we can move the above code snippet to ensureOrcFooter() and throw a "Malformed ORC file exception" or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call.
> Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job?
> Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed.
> Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)