You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2015/03/11 13:14:40 UTC

[jira] [Commented] (MAPREDUCE-210) want InputFormat for zip files

    [ https://issues.apache.org/jira/browse/MAPREDUCE-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356797#comment-14356797 ] 

Hari Sekhon commented on MAPREDUCE-210:
---------------------------------------

There is 3rd party zip inputformat here:

http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/

I think it's important for the zip inputformat to be natively supported because traditional enterprises where Hadoop is now starting to penetrate use zip a lot, especially in large corporates which are Windows heavy and don't realize the problems they are causing by having so many things in zip files that Hadoop currently can't read.

> want InputFormat for zip files
> ------------------------------
>
>                 Key: MAPREDUCE-210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-210
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Doug Cutting
>            Assignee: indrajit
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable to be able to split inputs into smaller chunks, with one or more small original file per split.  The zip format, unlike tar, permits enumeration of files in the archive without scanning the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives into splits that contain one or more archived files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)