You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2008/11/03 10:26:44 UTC

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

    [ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644669#action_12644669 ] 

Ankur commented on HADOOP-1824:
-------------------------------

There are 2 problems with this patch.

1. It does not split the zip files efficiently. This is because there is no way in Java to construct a zip input stream that permits random seeks given a zip entry name.
2. Java's handling of large zip file is not robust.

The plan was to modify the code to make use of an external zip parsing library that is compatible with Apache license. It was decided to use zip/unzip (standard shell tools) code via JNI but support for large zip files if still missing from unzip (Zip 3.0 is out with large zip file support). 

So at the moment, just waiting for Unzip 6.0 to come out and modify the code accrodingly.

> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable to be able to split inputs into smaller chunks, with one or more small original file per split.  The zip format, unlike tar, permits enumeration of files in the archive without scanning the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.