You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2007/12/27 09:24:43 UTC

[jira] Updated: (HADOOP-1824) want InputFormat for zip files

     [ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankur updated HADOOP-1824:
--------------------------


 Proposed Implementation Approach
--------------------------------------------------

1. Implement class ZipInputFormat to extend FileInputFormat.

2. Override the getSplits() method to read each file's
   InputStream and construct a ZipInputStream out of it.

3. Create FileSplits in a way that each file split has the following
   properties
	*  FileSplit.start = start index of a zip entry.
      *  FileSplit.length = end index of a zip entry.
      *  fileSplit.file = Zip file.
      *  Sum of compressed size of zip entries <= splitSize

   For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 
   will be read from the zip file of this split.

4. Implement class ZipRecordReader to read each zip entry in its split
   Using LineRecordReader. 

5. Each zip entry will be treated as a text file.

6. Implement the necessary unit test case classes.

Questions: 
=========
1. Is there a need to implement a ZipCodec (like GzipCodec and DefaultCodec) ?
2. Should the ZipRecordReader be flexible enough to treat the individual zip entries in a 
     FileSplit as being a text file or a sequence file ?

Please feel free to comment on anything that I missed which might be required.
Also any suggestions/recommendation to make the implementation better will be greatly
appreciated.

-Ankur

> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable to be able to split inputs into smaller chunks, with one or more small original file per split.  The zip format, unlike tar, permits enumeration of files in the archive without scanning the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.