You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Goel, Ankur" <An...@corp.aol.com> on 2007/12/26 13:06:11 UTC

HADOOP-1824 | Proposed implementation

 
Hi,
   I am working on developing an InputFormat for zip files
as required by HADOOP-1824. For the same I would like to propose
a simple approach and invite comments and suggestions from the 
community for my implementation.

Implementation Approach
-----------------------

1. Implement class ZipInputFormat to extend FileInputFormat.

2. Override the getSplits() method to read each file's
   InputStream and construct a ZipInputStream out of it.

3. Create FileSplits in a way that each file split has the following
   properties
	*  FileSplit.start = start index of a zip entry.
      *  FileSplit.length = end index of a zip entry.
      *  fileSplit.file = Zip file.
      *  Sum of compressed size of zip entries <= splitSize

   For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 
   will be read from the zip file of this split.

4. Implement class ZipRecordReader to read each zip entry in its split
   Using LineRecordReader.

I think I might be required to deal with compressionCodecFatory and
other
classes related to compression. How exactly, is not very clear to me.
So any hints here would be useful.

Apart from the above please let me know if there is anything that I am 
missing.

Thanks
-Ankur