You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Abdul Qadeer (JIRA)" <ji...@apache.org> on 2009/09/01 13:20:32 UTC

[jira] Commented: (HADOOP-4012) Providing splitting support for bzip2 compressed files

    [ https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749860#action_12749860 ] 

Abdul Qadeer commented on HADOOP-4012:
--------------------------------------

{quote}
The API issue with InputStreamCreationResultant is clearer to me, now (I hadn't seen why a reader might need a modified start). Other than synchronizing on the codec before creating new streams (to avoid the race condition), I don't see a better way to do this without pushing other API changes. Unless someone has a better idea, I think documenting this requirement on SplittableCompressionCodec is sufficient for now (and making these methods synchronized in BZip2Codec).
{quote}

{quote}
>From https://issues.apache.org/jira/browse/MAPREDUCE-830
Though it's not changed in bzip, since getEnd is part of the API, it should be called in LineRecordReader.
Since the codec has state, the API demands that LineRecordReader synchronize on the codec before creating a splittable stream and calling getStart and getEnd to avoid race conditions (unless a better solution is found in HADOOP-4012)
{quote}


Does the code in LineRecordReader is executed by multiple threads?  I found some discussion about it here http://issues.apache.org/jira/browse/HADOOP-3554  but that discussion probably didn't conclude.

Even if it is run by multiple threads, I am not able to see the race conditions because start / end is changed only once in the constructor (I am assuming that LineRecordReader constructor is not called by multiple threads simultaneously)

The only problem I see is that LineReacordReader should not forget to call getStart() method after calling createInputStream() method.  Or saying another way, any one using SplittableCompressionCodec  must call getStart() / get End() methods after creating the stream.

So I am little confused about the race condition comments.  Can you please help me understand?

Thanks.

> Providing splitting support for bzip2 compressed files
> ------------------------------------------------------
>
>                 Key: HADOOP-4012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4012
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.21.0
>
>         Attachments: Hadoop-4012-version1.patch, Hadoop-4012-version10.patch, Hadoop-4012-version11.patch, Hadoop-4012-version2.patch, Hadoop-4012-version3.patch, Hadoop-4012-version4.patch, Hadoop-4012-version5.patch, Hadoop-4012-version6.patch, Hadoop-4012-version7.patch, Hadoop-4012-version8.patch, Hadoop-4012-version9.patch
>
>
> Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully).  So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file.  The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.
> BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other.  This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel.  The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed.  (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).
> We are writing the code to implement this suggested functionality.  Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support.  The details of these changes will be posted when we submit the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.