You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jason Venner <ja...@attributor.com> on 2007/12/25 22:52:35 UTC

question on Hadoop configuration for non cpu intensive jobs - 0.15.1

We have two flavors of jobs we run through hadoop, the first flavor is a 
simple merge sort, where there is very little happening in the mapper or 
the reducer.
The second flavor are very compute intensive.

In the first type, our each map task consumes its (default sized) 64meg 
input split in a small number of seconds, resulting quite a bit of the 
elapsed time being spent in job setup and shutdown.

We have tried reducing the number of splits by increasing the block 
sizes to 10x and 5x 64meg, but then we constantly have out of memory 
errors and timeouts. At this point each jvm is getting 768M and I can't 
readily allocate more without dipping into swap.

What suggestions do people have for this case?

07/12/25 11:49:59 INFO mapred.JobClient: Task Id : 
task_200712251146_0001_m_000002_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52)
        at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
        at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1763)
        at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1663)
        at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1709)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

07/12/25 11:51:35 INFO mapred.JobClient: Task Id : 
task_200712251146_0001_r_000038_0, Status : FAILED
java.net.SocketTimeoutException: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:484)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
        at org.apache.hadoop.dfs.$Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:269)
        at 
org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:147)
        at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:161)
        at 
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:159)
        at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:118)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:90)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1759)

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Jason Venner <ja...@attributor.com>.

Yes, our sequence files are stored in hdfs.

Some of them are constructed via the FileUtil.copyMerge routine and some 
are the results of a mapper or a reducer and they are all in hdfs.


Eric Baldeschwieler wrote:
> I created HADOOP-2497 to describe this bug.
>
> Was your sequence file stored on HDFS?  Because HDFS does provide 
> checksums.
>
> On Dec 28, 2007, at 7:20 AM, Jason Venner wrote:
>
>> Our OOM was being caused by a damaged sequence data file. We had 
>> assumed that the sequence files had checksums, which appears to be in 
>> correct.
>> The deserializer was reading a bad length out of the file and trying 
>> to allocate 4gig of ram.
>

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

I created HADOOP-2497 to describe this bug.

Was your sequence file stored on HDFS?  Because HDFS does provide  
checksums.

On Dec 28, 2007, at 7:20 AM, Jason Venner wrote:

> Our OOM was being caused by a damaged sequence data file. We had  
> assumed that the sequence files had checksums, which appears to be  
> in correct.
> The deserializer was reading a bad length out of the file and  
> trying to allocate 4gig of ram.

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Jason Venner <ja...@attributor.com>.

Our OOM was being caused by a damaged sequence data file. We had assumed 
that the sequence files had checksums, which appears to be in correct.
The deserializer was reading a bad length out of the file and trying to 
allocate 4gig of ram.

RE: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Runping Qi <ru...@yahoo-inc.com>.


I encountered similar problems many times too, especially the input data
is compressed.
I had to raise the heapsize around 700MB to avoid oom problems in the
mappers.

Runping


> -----Original Message-----
> From: Devaraj Das [mailto:ddas@yahoo-inc.com]
> Sent: Friday, December 28, 2007 3:28 AM
> To: hadoop-user@lucene.apache.org
> Subject: RE: question on Hadoop configuration for non cpu intensive
jobs -
> 0.15.1
> 
> I am also interested in the test demonstrating OOM for large split
sizes
> (if
> this is true then it is indeed a bug). Sort & Spill-to-disk should
happen
> as
> soon as io.sort.mb amount of key/value data is collected. I am
assuming
> that
> you didn't change (increased) the value of io.sort.mb when you
increased
> the
> split size..
> 
> Thanks,
> Devaraj
> 
> > -----Original Message-----
> > From: Ted Dunning [mailto:tdunning@veoh.com]
> > Sent: Wednesday, December 26, 2007 4:31 AM
> > To: hadoop-user@lucene.apache.org
> > Subject: Re: question on Hadoop configuration for non cpu
> > intensive jobs - 0.15.1
> >
> >
> >
> > This sounds like a bug.
> >
> > The memory requirements for hadoop itself shouldn't change
> > with the split size.  At the very least, it should adapt
> > correctly to whatever the memory limits are.
> >
> > Can you build a version of your program that works from
> > random data so that you can file a bug?  If you contact me
> > off-line, I can help build a random data generator that
> > matches your input reasonably well.
> >
> >
> > On 12/25/07 2:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
> >
> > > My mapper in this case is the identity mapper, and the reducer
gets
> > > about 10 values per key and makes a collect decision based
> > on the data
> > > in the values.
> > > The reducer is very close to a no-op, and uses very little
> > additional
> > > memory than the values.
> > >
> > > I believe the problem is in the amount of buffering in the
> > output files.
> > >
> > > The quandary we have is the jobs run very poorly with the standard
> > > input split size as the mean time to finishing a split is
> > very small,
> > > vrs gigantic memory requirements for large split sizes.
> > >
> > > Time to play with parameters again ... since the answer
> > doesn't appear
> > > to be in working memory for the list.
> > >
> > >
> > >
> > > Ted Dunning wrote:
> > >> What are your mappers doing that they run out of memory?  Or is
it
> > >> your reducers?
> > >>
> > >> Often, you can write this sort of program so that you don't have
> > >> higher memory requirements for larger splits.
> > >>
> > >>
> > >> On 12/25/07 1:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
> > >>
> > >>
> > >>> We have tried reducing the number of splits by increasing
> > the block
> > >>> sizes to 10x and 5x 64meg, but then we constantly have
> > out of memory
> > >>> errors and timeouts. At this point each jvm is getting 768M and
I
> > >>> can't readily allocate more without dipping into swap.
> > >>>
> > >>
> > >>
> >
> >

RE: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Devaraj Das <dd...@yahoo-inc.com>.

I am also interested in the test demonstrating OOM for large split sizes (if
this is true then it is indeed a bug). Sort & Spill-to-disk should happen as
soon as io.sort.mb amount of key/value data is collected. I am assuming that
you didn't change (increased) the value of io.sort.mb when you increased the
split size..

Thanks,
Devaraj

> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com] 
> Sent: Wednesday, December 26, 2007 4:31 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: question on Hadoop configuration for non cpu 
> intensive jobs - 0.15.1
> 
> 
> 
> This sounds like a bug.
> 
> The memory requirements for hadoop itself shouldn't change 
> with the split size.  At the very least, it should adapt 
> correctly to whatever the memory limits are.
> 
> Can you build a version of your program that works from 
> random data so that you can file a bug?  If you contact me 
> off-line, I can help build a random data generator that 
> matches your input reasonably well.
> 
> 
> On 12/25/07 2:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
> 
> > My mapper in this case is the identity mapper, and the reducer gets 
> > about 10 values per key and makes a collect decision based 
> on the data 
> > in the values.
> > The reducer is very close to a no-op, and uses very little 
> additional 
> > memory than the values.
> > 
> > I believe the problem is in the amount of buffering in the 
> output files.
> > 
> > The quandary we have is the jobs run very poorly with the standard 
> > input split size as the mean time to finishing a split is 
> very small, 
> > vrs gigantic memory requirements for large split sizes.
> > 
> > Time to play with parameters again ... since the answer 
> doesn't appear 
> > to be in working memory for the list.
> > 
> > 
> > 
> > Ted Dunning wrote:
> >> What are your mappers doing that they run out of memory?  Or is it 
> >> your reducers?
> >> 
> >> Often, you can write this sort of program so that you don't have 
> >> higher memory requirements for larger splits.
> >> 
> >> 
> >> On 12/25/07 1:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
> >> 
> >>   
> >>> We have tried reducing the number of splits by increasing 
> the block 
> >>> sizes to 10x and 5x 64meg, but then we constantly have 
> out of memory 
> >>> errors and timeouts. At this point each jvm is getting 768M and I 
> >>> can't readily allocate more without dipping into swap.
> >>>     
> >> 
> >>   
> 
>

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Ted Dunning <td...@veoh.com>.


This sounds like a bug.

The memory requirements for hadoop itself shouldn't change with the split
size.  At the very least, it should adapt correctly to whatever the memory
limits are.

Can you build a version of your program that works from random data so that
you can file a bug?  If you contact me off-line, I can help build a random
data generator that matches your input reasonably well.


On 12/25/07 2:52 PM, "Jason Venner" <ja...@attributor.com> wrote:

> My mapper in this case is the identity mapper, and the reducer gets
> about 10 values per key and makes a collect decision based on the data
> in the values.
> The reducer is very close to a no-op, and uses very little additional
> memory than the values.
> 
> I believe the problem is in the amount of buffering in the output files.
> 
> The quandary we have is the jobs run very poorly with the standard input
> split size as the mean time to finishing a split is very small, vrs
> gigantic memory requirements for large split sizes.
> 
> Time to play with parameters again ... since the answer doesn't appear
> to be in working memory for the list.
> 
> 
> 
> Ted Dunning wrote:
>> What are your mappers doing that they run out of memory?  Or is it your
>> reducers?
>> 
>> Often, you can write this sort of program so that you don't have higher
>> memory requirements for larger splits.
>> 
>> 
>> On 12/25/07 1:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
>> 
>>   
>>> We have tried reducing the number of splits by increasing the block
>>> sizes to 10x and 5x 64meg, but then we constantly have out of memory
>>> errors and timeouts. At this point each jvm is getting 768M and I can't
>>> readily allocate more without dipping into swap.
>>>     
>> 
>>

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Jason Venner <ja...@attributor.com>.

My mapper in this case is the identity mapper, and the reducer gets 
about 10 values per key and makes a collect decision based on the data 
in the values.
The reducer is very close to a no-op, and uses very little additional 
memory than the values.

I believe the problem is in the amount of buffering in the output files.

The quandary we have is the jobs run very poorly with the standard input 
split size as the mean time to finishing a split is very small, vrs 
gigantic memory requirements for large split sizes.

Time to play with parameters again ... since the answer doesn't appear 
to be in working memory for the list.

Ted Dunning wrote:
> What are your mappers doing that they run out of memory?  Or is it your
> reducers?
>
> Often, you can write this sort of program so that you don't have higher
> memory requirements for larger splits.
>
>
> On 12/25/07 1:52 PM, "Jason Venner" <ja...@attributor.com> wrote:
>
>   
>> We have tried reducing the number of splits by increasing the block
>> sizes to 10x and 5x 64meg, but then we constantly have out of memory
>> errors and timeouts. At this point each jvm is getting 768M and I can't
>> readily allocate more without dipping into swap.
>>     
>
>

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Posted by Ted Dunning <td...@veoh.com>.

What are your mappers doing that they run out of memory?  Or is it your
reducers?

Often, you can write this sort of program so that you don't have higher
memory requirements for larger splits.

On 12/25/07 1:52 PM, "Jason Venner" <ja...@attributor.com> wrote:

> We have tried reducing the number of splits by increasing the block
> sizes to 10x and 5x 64meg, but then we constantly have out of memory
> errors and timeouts. At this point each jvm is getting 768M and I can't
> readily allocate more without dipping into swap.