You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Jagadish Bihani <ja...@pubmatic.com> on 2012/10/26 13:00:57 UTC

Flume bz2 issue while processing by a map reduce job

Hi

I have a very peculiar scenario.

  1. My HDFS sink creates a bz2 file. File is perfectly fine I can 
decompress it and
read it. It has 0.2 million records.
2. Now I give that file to map-reduce job (hadoop 1.0.3) and 
surprisingly it only
reads first 100 records.
3. I then decompress the same file on local file system and use bzip2 
command of
linux to again compress it and copy to HDFS.
4. Now I run the map -reduce job and this time it correctly processes 
all the records.

I think flume agent writes compressed data to HDFS file in batches. And 
somehow
bzip2 codec used by hadoop uses only first part of it.

This way bz2 files generated by Flume, if used directly, can't be 
processed by Map reduce job.
Is there any solution to it?

Any inputs about other compression formats?

P.S.
Versions:

Flume 1.2.0 (Raw version; downloaded from 
http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
Hadoop 1.0.3

Regards,
Jagadish

Re: Flume bz2 issue while processing by a map reduce job

Posted by Jagadish Bihani <ja...@pubmatic.com>.
Hi Mike

Thanks for the valuable inputs. That was driving us crazy.
But I had tested that this issue doesn't happen with compression format
  lzo/lzop (tested on hadoop 1.0.3).

Regards,
Jagadish



On 11/02/2012 03:16 PM, Mike Percy wrote:
> Hi Jagadish,
> My understanding based on investigating this issue over the last 
> couple of days is that MapReduce jobs will only read the first section 
> of a concatenaed bzip2 file. I believe you are correct that 
> https://issues.apache.org/jira/browse/HADOOP-6852 is the only way to 
> solve this issue, and that would only be for the Hadoop 2.0 line, I 
> believe. I think that the Hadoop 1.x line would need to backport other 
> patches from the 0.22 line, including 
> https://issues.apache.org/jira/browse/HADOOP-6835, which may also be 
> needed (my understanding is that that patch is already included in the 
> 2.x line).
>
> I am aware of folks interested in trying to fix HADOOP-6852, however I 
> have no ETA to give.
>
> From Flume's perspective, I know of no other way of ensuring 
> durability using the hadoop-common APIs except for calling finalize in 
> order to flume the compression buffer at each transaction/batch 
> boundary, in order to call hflush()/hsync() with the fully written 
> data. This results in concatenated compressed plain text files in the 
> case of CompressedStream.
>
> Current workarounds include not using compression, reprocessing the 
> compressed file as you mention, using a SequenceFile as a container, 
> or using an Avro file as a container. The latter two are splittable 
> and properly handle several compression codecs, including Snappy, 
> which is a great way to go if you can do it.
>
> Regards,
> Mike
>
> On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani 
> <jagadish.bihani@pubmatic.com <ma...@pubmatic.com>> 
> wrote:
>
>     Hi
>
>     Any inputs on this?
>     It looks like a basic thing which, I guess, must have been handled
>     in flume
>
>
>
>     On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
>>     Text.
>>
>>     Few updates on that:
>>     -- It looks like some header issue.
>>     -- When I copyToLocal the file and then again copy it back to HDFS,
>>     map reduce job processes the the file correctly then.
>>     Is it something related to
>>     https://issues.apache.org/jira/browse/HADOOP-6852?
>>
>>     Regards,
>>     Jagadish
>>
>>
>>     On 10/30/2012 09:15 PM, Brock Noland wrote:
>>>     What kind of files is your sink writing out? Text, Sequence, etc?
>>>
>>>     On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
>>>     <ja...@pubmatic.com>  <ma...@pubmatic.com>  wrote:
>>>>     Same thing happens even for gzip.
>>>>
>>>>     Regards,
>>>>     Jagadish
>>>>
>>>>
>>>>     On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>>>>     Hi
>>>>>
>>>>>     I have a very peculiar scenario.
>>>>>
>>>>>       1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>>>>     decompress it and
>>>>>     read it. It has 0.2 million records.
>>>>>     2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>>>>     it only
>>>>>     reads first 100 records.
>>>>>     3. I then decompress the same file on local file system and use bzip2
>>>>>     command of
>>>>>     linux to again compress it and copy to HDFS.
>>>>>     4. Now I run the map -reduce job and this time it correctly processes all
>>>>>     the records.
>>>>>
>>>>>     I think flume agent writes compressed data to HDFS file in batches. And
>>>>>     somehow
>>>>>     bzip2 codec used by hadoop uses only first part of it.
>>>>>
>>>>>     This way bz2 files generated by Flume, if used directly, can't be
>>>>>     processed by Map reduce job.
>>>>>     Is there any solution to it?
>>>>>
>>>>>     Any inputs about other compression formats?
>>>>>
>>>>>     P.S.
>>>>>     Versions:
>>>>>
>>>>>     Flume 1.2.0 (Raw version; downloaded from
>>>>>     http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>>>>>     Hadoop 1.0.3
>>>>>
>>>>>     Regards,
>>>>>     Jagadish
>>
>
>


Re: Flume bz2 issue while processing by a map reduce job

Posted by Mike Percy <mp...@apache.org>.
Hi Jagadish,
My understanding based on investigating this issue over the last couple of
days is that MapReduce jobs will only read the first section of a
concatenated bzip2 file. I believe you are correct that
https://issues.apache.org/jira/browse/HADOOP-6852 is the only way to solve
this issue, and that would only be for the Hadoop 2.0 line, I believe. I
think that the Hadoop 1.x line would need to backport other patches from
the 0.22 line, including https://issues.apache.org/jira/browse/HADOOP-6835,
which may also be needed (my understanding is that that patch is already
included in the 2.x line).

I am aware of folks interested in trying to fix HADOOP-6852, however I have
no ETA to give.

>From Flume's perspective, I know of no other way of ensuring durability
using the hadoop-common APIs except for calling finalize in order to flume
the compression buffer at each transaction/batch boundary, in order to call
hflush()/hsync() with the fully written data. This results in concatenated
compressed plain text files in the case of CompressedStream.

Current workarounds include not using compression, reprocessing the
compressed file as you mention, using a SequenceFile as a container, or
using an Avro file as a container. The latter two are splittable and
properly handle several compression codecs, including Snappy, which is a
great way to go if you can do it.

Regards,
Mike

On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani <
jagadish.bihani@pubmatic.com> wrote:

>  Hi
>
> Any inputs on this?
> It looks like a basic thing which, I guess, must have been handled in flume
>
>
>
> On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
>
> Text.
>
> Few updates on that:
> -- It looks like some header issue.
> -- When I copyToLocal the file and then again copy it back to HDFS,
> map reduce job processes the the file correctly then.
> Is it something related to
> https://issues.apache.org/jira/browse/HADOOP-6852?
>
> Regards,
> Jagadish
>
>
> On 10/30/2012 09:15 PM, Brock Noland wrote:
>
> What kind of files is your sink writing out? Text, Sequence, etc?
>
> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani<ja...@pubmatic.com> <ja...@pubmatic.com> wrote:
>
>  Same thing happens even for gzip.
>
> Regards,
> Jagadish
>
>
> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>
>  Hi
>
> I have a very peculiar scenario.
>
>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
> decompress it and
> read it. It has 0.2 million records.
> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
> it only
> reads first 100 records.
> 3. I then decompress the same file on local file system and use bzip2
> command of
> linux to again compress it and copy to HDFS.
> 4. Now I run the map -reduce job and this time it correctly processes all
> the records.
>
> I think flume agent writes compressed data to HDFS file in batches. And
> somehow
> bzip2 codec used by hadoop uses only first part of it.
>
> This way bz2 files generated by Flume, if used directly, can't be
> processed by Map reduce job.
> Is there any solution to it?
>
> Any inputs about other compression formats?
>
> P.S.
> Versions:
>
> Flume 1.2.0 (Raw version; downloaded fromhttp://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
> Hadoop 1.0.3
>
> Regards,
> Jagadish
>
>
>
>

Re: Flume bz2 issue while processing by a map reduce job

Posted by Jagadish Bihani <ja...@pubmatic.com>.
Hi

Any inputs on this?
It looks like a basic thing which, I guess, must have been handled in flume


On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
> Text.
>
> Few updates on that:
> -- It looks like some header issue.
> -- When I copyToLocal the file and then again copy it back to HDFS,
> map reduce job processes the the file correctly then.
> Is it something related to 
> https://issues.apache.org/jira/browse/HADOOP-6852?
>
> Regards,
> Jagadish
>
>
> On 10/30/2012 09:15 PM, Brock Noland wrote:
>> What kind of files is your sink writing out? Text, Sequence, etc?
>>
>> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
>> <ja...@pubmatic.com>  wrote:
>>> Same thing happens even for gzip.
>>>
>>> Regards,
>>> Jagadish
>>>
>>>
>>> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>>> Hi
>>>>
>>>> I have a very peculiar scenario.
>>>>
>>>>   1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>>> decompress it and
>>>> read it. It has 0.2 million records.
>>>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>>> it only
>>>> reads first 100 records.
>>>> 3. I then decompress the same file on local file system and use bzip2
>>>> command of
>>>> linux to again compress it and copy to HDFS.
>>>> 4. Now I run the map -reduce job and this time it correctly processes all
>>>> the records.
>>>>
>>>> I think flume agent writes compressed data to HDFS file in batches. And
>>>> somehow
>>>> bzip2 codec used by hadoop uses only first part of it.
>>>>
>>>> This way bz2 files generated by Flume, if used directly, can't be
>>>> processed by Map reduce job.
>>>> Is there any solution to it?
>>>>
>>>> Any inputs about other compression formats?
>>>>
>>>> P.S.
>>>> Versions:
>>>>
>>>> Flume 1.2.0 (Raw version; downloaded from
>>>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>>>> Hadoop 1.0.3
>>>>
>>>> Regards,
>>>> Jagadish
>>
>


Re: Flume bz2 issue while processing by a map reduce job

Posted by Jagadish Bihani <ja...@pubmatic.com>.
Text.

Few updates on that:
-- It looks like some header issue.
-- When I copyToLocal the file and then again copy it back to HDFS,
map reduce job processes the the file correctly then.
Is it something related to 
https://issues.apache.org/jira/browse/HADOOP-6852?

Regards,
Jagadish


On 10/30/2012 09:15 PM, Brock Noland wrote:
> What kind of files is your sink writing out? Text, Sequence, etc?
>
> On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
> <ja...@pubmatic.com> wrote:
>> Same thing happens even for gzip.
>>
>> Regards,
>> Jagadish
>>
>>
>> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>> Hi
>>>
>>> I have a very peculiar scenario.
>>>
>>>   1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>>> decompress it and
>>> read it. It has 0.2 million records.
>>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>>> it only
>>> reads first 100 records.
>>> 3. I then decompress the same file on local file system and use bzip2
>>> command of
>>> linux to again compress it and copy to HDFS.
>>> 4. Now I run the map -reduce job and this time it correctly processes all
>>> the records.
>>>
>>> I think flume agent writes compressed data to HDFS file in batches. And
>>> somehow
>>> bzip2 codec used by hadoop uses only first part of it.
>>>
>>> This way bz2 files generated by Flume, if used directly, can't be
>>> processed by Map reduce job.
>>> Is there any solution to it?
>>>
>>> Any inputs about other compression formats?
>>>
>>> P.S.
>>> Versions:
>>>
>>> Flume 1.2.0 (Raw version; downloaded from
>>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>>> Hadoop 1.0.3
>>>
>>> Regards,
>>> Jagadish
>>
>
>


Re: Flume bz2 issue while processing by a map reduce job

Posted by Brock Noland <br...@cloudera.com>.
What kind of files is your sink writing out? Text, Sequence, etc?

On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
<ja...@pubmatic.com> wrote:
>
> Same thing happens even for gzip.
>
> Regards,
> Jagadish
>
>
> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>>
>> Hi
>>
>> I have a very peculiar scenario.
>>
>>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
>> decompress it and
>> read it. It has 0.2 million records.
>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
>> it only
>> reads first 100 records.
>> 3. I then decompress the same file on local file system and use bzip2
>> command of
>> linux to again compress it and copy to HDFS.
>> 4. Now I run the map -reduce job and this time it correctly processes all
>> the records.
>>
>> I think flume agent writes compressed data to HDFS file in batches. And
>> somehow
>> bzip2 codec used by hadoop uses only first part of it.
>>
>> This way bz2 files generated by Flume, if used directly, can't be
>> processed by Map reduce job.
>> Is there any solution to it?
>>
>> Any inputs about other compression formats?
>>
>> P.S.
>> Versions:
>>
>> Flume 1.2.0 (Raw version; downloaded from
>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>> Hadoop 1.0.3
>>
>> Regards,
>> Jagadish
>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Flume compression peculiar behaviour while processing compressed files by a map reduce job

Posted by Jagadish Bihani <ja...@pubmatic.com>.
Does anyone have any inputs about why below mentioned behaviour might 
have happened?

On 10/26/2012 06:32 PM, Jagadish Bihani wrote:
>
> Same thing happens even for gzip.
>
> Regards,
> Jagadish
>
> On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
>> Hi
>>
>> I have a very peculiar scenario.
>>
>>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can 
>> decompress it and
>> read it. It has 0.2 million records.
>> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and 
>> surprisingly it only
>> reads first 100 records.
>> 3. I then decompress the same file on local file system and use bzip2 
>> command of
>> linux to again compress it and copy to HDFS.
>> 4. Now I run the map -reduce job and this time it correctly processes 
>> all the records.
>>
>> I think flume agent writes compressed data to HDFS file in batches. 
>> And somehow
>> bzip2 codec used by hadoop uses only first part of it.
>>
>> This way bz2 files generated by Flume, if used directly, can't be 
>> processed by Map reduce job.
>> Is there any solution to it?
>>
>> Any inputs about other compression formats?
>>
>> P.S.
>> Versions:
>>
>> Flume 1.2.0 (Raw version; downloaded from 
>> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
>> Hadoop 1.0.3
>>
>> Regards,
>> Jagadish
>


Re: Flume bz2 issue while processing by a map reduce job

Posted by Jagadish Bihani <ja...@pubmatic.com>.
Same thing happens even for gzip.

Regards,
Jagadish

On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
> Hi
>
> I have a very peculiar scenario.
>
>  1. My HDFS sink creates a bz2 file. File is perfectly fine I can 
> decompress it and
> read it. It has 0.2 million records.
> 2. Now I give that file to map-reduce job (hadoop 1.0.3) and 
> surprisingly it only
> reads first 100 records.
> 3. I then decompress the same file on local file system and use bzip2 
> command of
> linux to again compress it and copy to HDFS.
> 4. Now I run the map -reduce job and this time it correctly processes 
> all the records.
>
> I think flume agent writes compressed data to HDFS file in batches. 
> And somehow
> bzip2 codec used by hadoop uses only first part of it.
>
> This way bz2 files generated by Flume, if used directly, can't be 
> processed by Map reduce job.
> Is there any solution to it?
>
> Any inputs about other compression formats?
>
> P.S.
> Versions:
>
> Flume 1.2.0 (Raw version; downloaded from 
> http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
> Hadoop 1.0.3
>
> Regards,
> Jagadish