You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Sinclair <ds...@chariotsolutions.com> on 2011/01/21 22:04:17 UTC
Losing Records with Block Compressed Sequence File
Hi, I am seeing an odd problem when writing block compressed sequence files.
If I write 400,000 records into a sequence file w/o compression, all 400K
end up in the file. If I write with block, regardless if it is bz2 or
deflate, I start losing records. Not a ton, but a couple hundred.
Here are the exact numbers
bz2 399,734
deflate 399,770
none 400,000
Conf settings
io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
anyone ever see this behavior?
thanks
dave
Re: Losing Records with Block Compressed Sequence File
Posted by David Sinclair <ds...@chariotsolutions.com>.
That is what I was thinking. I am using Flume for log
collection/aggregation. I'll have a look at the code to see what is going
on, thanks.
On Fri, Jan 21, 2011 at 6:43 PM, Alan Malloy <al...@yieldbuild.com>wrote:
> Make sure to close the output writer? I had similar problems in a different
> scenario and it turned out I was neglecting to close/flush my output.
>
>
> On 01/21/2011 01:04 PM, David Sinclair wrote:
>
>> Hi, I am seeing an odd problem when writing block compressed sequence
>> files.
>> If I write 400,000 records into a sequence file w/o compression, all 400K
>> end up in the file. If I write with block, regardless if it is bz2 or
>> deflate, I start losing records. Not a ton, but a couple hundred.
>>
>> Here are the exact numbers
>>
>> bz2 399,734
>> deflate 399,770
>> none 400,000
>>
>> Conf settings
>> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>>
>> anyone ever see this behavior?
>>
>> thanks
>>
>> dave
>>
>>
Re: Losing Records with Block Compressed Sequence File
Posted by Alan Malloy <al...@yieldbuild.com>.
Make sure to close the output writer? I had similar problems in a
different scenario and it turned out I was neglecting to close/flush my
output.
On 01/21/2011 01:04 PM, David Sinclair wrote:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.
>
> Here are the exact numbers
>
> bz2 399,734
> deflate 399,770
> none 400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?
>
> thanks
>
> dave
>
Re: Losing Records with Block Compressed Sequence File
Posted by Greg Roelofs <ro...@yahoo-inc.com>.
> A few days ago I tried my Unit test against bzip2 and found a similar
> effect: records go missing at the seems between the splits.
> Perhaps my unit test is buggy, perhaps you and I have independently
> found something that should be reported as a bug.
Probably. I found a different bug (apparently?) in the bzip2 decoder
a while back: HADOOP-6852. If you're not concatenating bzip2 streams,
you're seeing something else.
Greg
Re: Losing Records with Block Compressed Sequence File
Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,
2011/1/21 David Sinclair <ds...@chariotsolutions.com>:
> Hi, I am seeing an odd problem when writing block compressed sequence files.
> If I write 400,000 records into a sequence file w/o compression, all 400K
> end up in the file. If I write with block, regardless if it is bz2 or
> deflate, I start losing records. Not a ton, but a couple hundred.
How big is the output file?
How many splits are created?
> Here are the exact numbers
>
> bz2 399,734
> deflate 399,770
> none 400,000
>
> Conf settings
> io.file.buffer.size - 4K, io.seqfile.compress.blocksize - 1MB
>
> anyone ever see this behavior?
I've been working on HADOOP-7076 which makes Gzip Splittable (feature
is almost done).
For this I created a junit test that really hammers the splitting and
checks if all "seems" are accurate (no missing records and no double
records).
A few days ago I tried my Unit test against bzip2 and found a similar
effect: records go missing at the seems between the splits.
Perhaps my unit test is buggy, perhaps you and I have independently
found something that should be reported as a bug.
--
Best regards,
Niels Basjes