You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Kuo <ku...@gmail.com> on 2010/02/14 21:11:19 UTC

Problem with large .lzo files

I am running a hadoop job that combines daily results with results with
previous days.  The reduce output is lzo compressed and growing daily in
size.


   - DistributedLzoIndexer is used to index lzo files to provide
   parallelism.  When the size of the lzo files were small, everything went
   well.  As the size of .lzo files grow, the chance that one of the partition
   does not complete increases.  The exception I got for one such case is
   listed at the end of the post.
   - It's possible to sacrifice parallelism by having hadoop work on each
   .lzo file without indexing.  This worked well until the file size exceeded
   30G when array indexing exception got thrown.  Apparently the code processed
   the file in chunks and stored the references to the chunk in an array.  When
   the number of chunks was greater than a certain number (around 256 was my
   recollection), exception was thrown.
   - My current work around is to increase the number of reducers to keep
   the .lzo file size low.

I would like to get advices on how people handle large .lzo files.  Any
pointers on the cause of the stack trace below and best way to resolve it
are greatly appreciated.


Task Logs: 'attempt_201002041203_0032_m_000134_3'

syslog logs
2010-02-13 19:59:51,839 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processNa
me=MAP, sessionId=
2010-02-13 19:59:52,045 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: record buffer
= 262144/327680
2010-02-13 19:59:52,099 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-02-13 19:59:52,111 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo
 library [hadoop-lzo rev fatal: Not a git repository (or any of the parent
directories): .git]
2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar:
InputFile=hdfs://ndbig03:9000/user/hadoop/f
oo/basr_old/part-r-00000.lzo
2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar:
maxBarCount =1000
2010-02-13 19:59:55,267 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 18641936; bufvoid = 9961
4720
2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 262144; length = 327680
2010-02-13 19:59:55,958 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-02-13 19:59:56,730 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: bufstart =
18641936; bufend = 37532762; bufvoid
 = 99614720
2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: kvstart =
262144; kvend = 196607; length = 3276
80
2010-02-13 19:59:59,487 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: bufstart =
37532762; bufend = 56161807; bufvoid
 = 99614720
2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: kvstart =
196607; kvend = 131070; length = 3276
80
2010-02-13 20:00:02,282 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: bufstart =
56161807; bufend = 74935116; bufvoid
 = 99614720
2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: kvstart =
131070; kvend = 65533; length = 32768
0
2010-02-13 20:00:05,100 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 3
2010-02-13 20:00:06,177 FATAL org.apache.hadoop.mapred.TaskTracker: Error
running child : java.lang.InternalEr
ror: lzo1x_decompress_safe returned:
    at
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
    at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303)
    at
com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:104)
    at
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223)
    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
    at java.io.InputStream.read(InputStream.java:85)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
    at
com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126)
    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
    at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Re: Problem with large .lzo files

Posted by Steve Kuo <ku...@gmail.com>.

Splitting the problem file into two smaller ones allowed the process to
succeed.  This, I believe pointed to some bug in LzoDecompressor.

Re: Problem with large .lzo files

Posted by Steve Kuo <ku...@gmail.com>.

Here is the ArrayIndexOutOfBoundException from LzoDecompressor.  I have 30
800MB of lzo files.  16 of them were processed by the Mapper successfully
with splits around 60.  File part-r-00012.lzo failed at split 45.  It's
clear that it was not size related.  I have copy out part-r-00012.lzo, split
it in half, and lzo'ed the small files.  I am now running the same set of
input but with the smaller files instead to see if the process complete.

Below is the abridged stack trace.

2010-02-21 11:15:34,718 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processNa
me=MAP, sessionId=
2010-02-21 11:15:34,909 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: record buffer
= 262144/327680
2010-02-21 11:15:34,972 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-02-21 11:15:34,977 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo
 library [hadoop-lzo rev fatal: Not a git repository (or any of the parent
directories): .git]
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz:
InputFile=hdfs://foo-ndbig03.lax1.foo.com:9000/
user/hadoop/baz/faz_old/part-r-00012.lzo
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: maxFazCount
=1000
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 18821149; bufvoid = 9961
4720
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 262144; length = 327680
2010-02-21 11:15:39,125 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-02-21 11:15:39,738 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: bufstart =
18821149; bufend = 37722479; bufvoid
 = 99614720
2010-02-21 11:15:41,918 INFO org.apache.hadoop.mapred.MapTask: kvstart =
262144; kvend = 196607; length = 3276
80
2010-02-21 11:15:42,642 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2010-02-21 11:15:44,564 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: bufstart =
37722479; bufend = 56518058; bufvoid
 = 99614720
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: kvstart =
196607; kvend = 131070; length = 3276
80
2010-02-21 11:15:45,485 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2

...
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: bufstart =
32585253; bufend = 51221187; bufvoid
 = 99614720
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65501; kvend = 327645; length = 32768
0
2010-02-21 11:17:34,056 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 44
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: bufstart =
51221187; bufend = 70104763; bufvoid
 = 99614720
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: kvstart =
327645; kvend = 262108; length = 3276
80
2010-02-21 11:17:36,731 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 45
2010-02-21 11:17:37,802 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.lang.ArrayIndexOutOfBoundsException
    at
com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:200)
    at
com.hadoop.compression.lzo.LzopDecompressor.setInput(LzopDecompressor.java:98)
    at
com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:297)
    at
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:232)
    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
    at java.io.InputStream.read(InputStream.java:85)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
    at
com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126)
    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
    at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2010-02-21 11:17:37,809 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task

Re: Problem with large .lzo files

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Feb 15, 2010 at 2:48 PM, Steve Kuo <ku...@gmail.com> wrote:
> You should be able to downcast the InputSplit to FileSplit, if you're
>> using the new API. From there you can get the start and length of the
>> split.
>>
>> Cool, let me give it a shot.
>
>
>> Interesting. If you can somehow make a reproducible test case I'd be
>> happy to look into this.
>>
>> This sounds great.  As the input file is 1G, let me do some work on my side
> to see if I can pinpoint it so as not have to transfer a 1G file around.
>

Yea, if you could generate a 1G file programatically that experiences
the problem, that would work too.

-Todd

Re: Problem with large .lzo files

Posted by Steve Kuo <ku...@gmail.com>.

You should be able to downcast the InputSplit to FileSplit, if you're
> using the new API. From there you can get the start and length of the
> split.
>
> Cool, let me give it a shot.


> Interesting. If you can somehow make a reproducible test case I'd be
> happy to look into this.
>
> This sounds great.  As the input file is 1G, let me do some work on my side
to see if I can pinpoint it so as not have to transfer a 1G file around.

Thanks.

Re: Problem with large .lzo files

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo <ku...@gmail.com> wrote:
> On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
> By the way, if all files have been indexed, DistributedLzoIndexer does not
> detect that and hadoop throws an exception complaining that the input dir
> (or file) does not exist.  I work around this by catching the exception.
>

Just fixed that in my github repo. Thanks for the bug report.

>
>> >   - It's possible to sacrifice parallelism by having hadoop work on each
>> >   .lzo file without indexing.  This worked well until the file size
>> exceeded
>> >   30G when array indexing exception got thrown.  Apparently the code
>> processed
>> >   the file in chunks and stored the references to the chunk in an array.
>>  When
>> >   the number of chunks was greater than a certain number (around 256 was
>> my
>> >   recollection), exception was thrown.
>> >   - My current work around is to increase the number of reducers to keep
>> >   the .lzo file size low.
>> >
>> > I would like to get advices on how people handle large .lzo files.  Any
>> > pointers on the cause of the stack trace below and best way to resolve it
>> > are greatly appreciated.
>> >
>>
>> Is this reproducible every time? If so, is it always at the same point
>> in the LZO file that it occurs?
>>
>> It's at the same point.  Do you know how to print out the lzo index for the
> task?  I only print out the input file now.
>

You should be able to downcast the InputSplit to FileSplit, if you're
using the new API. From there you can get the start and length of the
split.

>
>> Would it be possible to download that lzo file to your local box and
>> use lzop -d to see if it decompresses successfully? That way we can
>> isolate whether it's a compression bug or decompression.
>>
>> Bothe java LzoDecompressor and lzop -d were able to decompress the file
> correctly.  As a matter of fact, my job does not index .lzo files now but
> process each as a whole and it works
>

Interesting. If you can somehow make a reproducible test case I'd be
happy to look into this.

Thanks
-Todd

Re: Problem with large .lzo files

Posted by Steve Kuo <ku...@gmail.com>.

On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Steve,
>
> I'm not sure here whether you mean that the DistributedLzoIndexer job
> is failing, or if the job on the resulting split file is faiing. Could
> you clarify?
>
>
DistributedLzoIndexer job did successfully complete.  It was one of the jobs
on the resulting split file always failed while other splits succeeded.

By the way, if all files have been indexed, DistributedLzoIndexer does not
detect that and hadoop throws an exception complaining that the input dir
(or file) does not exist.  I work around this by catching the exception.


> >   - It's possible to sacrifice parallelism by having hadoop work on each
> >   .lzo file without indexing.  This worked well until the file size
> exceeded
> >   30G when array indexing exception got thrown.  Apparently the code
> processed
> >   the file in chunks and stored the references to the chunk in an array.
>  When
> >   the number of chunks was greater than a certain number (around 256 was
> my
> >   recollection), exception was thrown.
> >   - My current work around is to increase the number of reducers to keep
> >   the .lzo file size low.
> >
> > I would like to get advices on how people handle large .lzo files.  Any
> > pointers on the cause of the stack trace below and best way to resolve it
> > are greatly appreciated.
> >
>
> Is this reproducible every time? If so, is it always at the same point
> in the LZO file that it occurs?
>
> It's at the same point.  Do you know how to print out the lzo index for the
task?  I only print out the input file now.


> Would it be possible to download that lzo file to your local box and
> use lzop -d to see if it decompresses successfully? That way we can
> isolate whether it's a compression bug or decompression.
>
> Bothe java LzoDecompressor and lzop -d were able to decompress the file
correctly.  As a matter of fact, my job does not index .lzo files now but
process each as a whole and it works

Re: Problem with large .lzo files

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Steve,

On Sun, Feb 14, 2010 at 12:11 PM, Steve Kuo <ku...@gmail.com> wrote:
> I am running a hadoop job that combines daily results with results with
> previous days.  The reduce output is lzo compressed and growing daily in
> size.
>
>
>   - DistributedLzoIndexer is used to index lzo files to provide
>   parallelism.  When the size of the lzo files were small, everything went
>   well.  As the size of .lzo files grow, the chance that one of the partition
>   does not complete increases.  The exception I got for one such case is
>   listed at the end of the post.

I'm not sure here whether you mean that the DistributedLzoIndexer job
is failing, or if the job on the resulting split file is faiing. Could
you clarify?

>   - It's possible to sacrifice parallelism by having hadoop work on each
>   .lzo file without indexing.  This worked well until the file size exceeded
>   30G when array indexing exception got thrown.  Apparently the code processed
>   the file in chunks and stored the references to the chunk in an array.  When
>   the number of chunks was greater than a certain number (around 256 was my
>   recollection), exception was thrown.
>   - My current work around is to increase the number of reducers to keep
>   the .lzo file size low.
>
> I would like to get advices on how people handle large .lzo files.  Any
> pointers on the cause of the stack trace below and best way to resolve it
> are greatly appreciated.
>

Is this reproducible every time? If so, is it always at the same point
in the LZO file that it occurs?

Would it be possible to download that lzo file to your local box and
use lzop -d to see if it decompresses successfully? That way we can
isolate whether it's a compression bug or decompression.

Thanks
-Todd