You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2009/03/31 23:16:50 UTC

[jira] Updated: (HADOOP-4652) RAgzip: multiple map tasks for a large gzipped file

     [ https://issues.apache.org/jira/browse/HADOOP-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4652:
----------------------------------

    Status: Patch Available  (was: Open)

Running patch through Hudson

> RAgzip: multiple map tasks for a large gzipped file
> ---------------------------------------------------
>
>                 Key: HADOOP-4652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4652
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred, native
>    Affects Versions: 0.19.0, 0.18.3
>            Reporter: Daehyun Kim
>            Assignee: Daehyun Kim
>            Priority: Minor
>         Attachments: HADOOP-4652-v2.patch, HADOOP-4652-v3.patch, HADOOP-4652.path
>
>
> Currently, the hadoop processes gzipped files with only one map.
> We have made a patch that enables multiple map tasks for one large gzipped file. We call the patch RAgzip.
> To process multiple map tasks for gzipped file, you may use RAgzip by just changing InputFormat to RAGZIPInputFormat.
> The option used in RAGZIPInputFormat can be found at the javadoc of RAGZIPInputFormat part.
> RAgzip uses zlib's inflatePrime function which supports random access on a gzipped file. 
> Since the inflatePrime is supported from the version of 1.2.2.4, it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> RAgzip requires the preprocessing step that creates an access point (.ap) file, which is like the index of the gzipped file chunks. 
> The access point(.ap) file is located in same path of the gzipped file.
> If there is a "/user/hadoop/test.gz", the .ap file is created with "/user/hadoop/test.gz.ap".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.