You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2007/12/04 01:46:43 UTC

[jira] Commented: (HADOOP-1823) want InputFormat for bzip2 files

    [ https://issues.apache.org/jira/browse/HADOOP-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548075 ] 

Doug Cutting commented on HADOOP-1823:
--------------------------------------

Why did you need to modify Ant's bzip2 code?  Could it not be used as is?

I'd hate to have to copy this code into Hadoop.  We could create our own jar of it, extracted from ant's jar, or perhaps this would be an appropriate place to use subversion's "externals" feature.  We could link to a tagged version of the sources in Ant's tree.

I note there's also a commons project which has copied this code from ant, but it does not yet have any releases.  I guess we could include its nightly jar, since it is code that's already been released by Ant...

> want InputFormat for bzip2 files
> --------------------------------
>
>                 Key: HADOOP-1823
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1823
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: bzip2.jar
>
>
> Unlike gzip, the bzip file format supports splitting.  Compression is by blocks (900k by default) and blocks are separated by a synchronization marker (a 48-bit approximation of Pi).  This would permit very large compressed files to be split into multiple map tasks, which is not currently possible unless using a Hadoop-specific file format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.