You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by David Sinclair <ds...@chariotsolutions.com> on 2011/01/21 22:04:17 UTC

Losing Records with Block Compressed Sequence File

Hi, I am seeing an odd problem when writing block compressed sequence files.
If I write 400,000 records into a sequence file w/o compression, all 400K
end up in the file. If I write with block, regardless if it is bz2 or
deflate, I start losing records. Not a ton, but a couple hundred.

Here are the exact numbers

bz2      399,734
deflate  399,770
none     400,000

Conf settings
io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB

anyone ever see this behavior?

thanks

dave

Re: Losing Records with Block Compressed Sequence File

Posted by David Sinclair <ds...@chariotsolutions.com>.

That is what I was thinking. I am using Flume for log
collection/aggregation. I'll have a look at the code to see what is going
on, thanks.

On Fri, Jan 21, 2011 at 6:43 PM, Alan Malloy <al...@yieldbuild.com>wrote:

> Make sure to close the output writer? I had similar problems in a different
> scenario and it turned out I was neglecting to close/flush my output.
>
>
> On 01/21/2011 01:04 PM, David Sinclair wrote:
>
>> Hi, I am seeing an odd problem when writing block compressed sequence
>> files.
>> If I write 400,000 records into a sequence file w/o compression, all 400K
>> end up in the file. If I write with block, regardless if it is bz2 or
>> deflate, I start losing records. Not a ton, but a couple hundred.
>>
>> Here are the exact numbers
>>
>> bz2      399,734
>> deflate  399,770
>> none     400,000
>>
>> Conf settings
>> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>>
>> anyone ever see this behavior?
>>
>> thanks
>>
>> dave
>>
>>

Re: Losing Records with Block Compressed Sequence File

Posted by Alan Malloy <al...@yieldbuild.com>.

Make sure to close the output writer? I had similar problems in a 
different scenario and it turned out I was neglecting to close/flush my 
output.

On 01/21/2011 01:04 PM, David Sinclair wrote:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.
>
> Here are the exact numbers
>
> bz2      399,734
> deflate  399,770
> none     400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?
>
> thanks
>
> dave
>

Re: Losing Records with Block Compressed Sequence File

Posted by Greg Roelofs <ro...@yahoo-inc.com>.

> A few days ago I tried my Unit test against bzip2 and found a similar
> effect: records go missing at the seems between the splits.

> Perhaps my unit test is buggy, perhaps you and I have independently
> found something that should be reported as a bug.

Probably.  I found a different bug (apparently?) in the bzip2 decoder
a while back:  HADOOP-6852.  If you're not concatenating bzip2 streams,
you're seeing something else.

Greg

Re: Losing Records with Block Compressed Sequence File

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

2011/1/21 David Sinclair <ds...@chariotsolutions.com>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.

How big is the output file?
How many splits are created?

> Here are the exact numbers
>
> bz2      399,734
> deflate  399,770
> none     400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?

I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
records).
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.

Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
-- 
Best regards,

Niels Basjes