You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ryan LeCompte <le...@gmail.com> on 2008/09/21 04:07:34 UTC

Reduce tasks running out of memory on small hadoop cluster

Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan

Re: Reduce tasks running out of memory on small hadoop cluster

Posted by Karl Anderson <ka...@somethingsimpler.com>.
On 20-Sep-08, at 7:07 PM, Ryan LeCompte wrote:

> Hello all,
>
> I'm setting up a small 3 node hadoop cluster (1 node for
> namenode/jobtracker and the other two for datanode/tasktracker). The
> map tasks finish fine, but the reduce tasks are failing at about 30%
> with an out of memory error. My guess is because the amount of data
> that I'm crunching through just won't be able to fit in memory during
> the reduce tasks on two machines (max of 2 reduce tasks on each
> machine). Is this expected? If I had a large hadoop cluster, then I
> could increase the number of reduce tasks on each machine of the
> cluster so that not all of the data to be processed is occurring in
> just 4 JVMs on two machines like I currently have setup, correct? Is
> there any way to get the reduce task to not try and hold all of the
> data in memory, or is my only option to add more nodes to the cluster
> to therefore increase the number of reduce tasks?

You can set the number of reduce tasks with a configuration option.   
More tasks means less input per task; since the number of concurrent  
tasks doesn't change, this should help you.  I'd like to be able to  
set the number of concurrent tasks, myself, but haven't noticed a way.

In the end, I had to practice better design to reduce my memory  
footprint; sometimes one quick-and-dirty way to do this is to turn one  
job into a chain of jobs that each do less.


Karl Anderson
kra@monkey.org
http://monkey.org/~kra




Re: Reduce tasks running out of memory on small hadoop cluster

Posted by Ryan LeCompte <le...@gmail.com>.
I actually solved the problem by increasing a parameter in
hadoop-site.xml, since the default wasn't sufficient:

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024m</value>
</property>

Thanks,
Ryan


On Sun, Sep 21, 2008 at 12:59 AM, Ryan LeCompte <le...@gmail.com> wrote:
> Yes I did, but that didn't solve my problem since I'm working with a fairly
> large data set (8gb).
>
> Thanks,
> Ryan
>
>
>
>
> On Sep 21, 2008, at 12:22 AM, Sandy <sn...@gmail.com> wrote:
>
>> Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped
>> me some, but eventually I had to upgrade to a system with more memory.
>>
>> -SM
>>
>>
>> On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <le...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm setting up a small 3 node hadoop cluster (1 node for
>>> namenode/jobtracker and the other two for datanode/tasktracker). The
>>> map tasks finish fine, but the reduce tasks are failing at about 30%
>>> with an out of memory error. My guess is because the amount of data
>>> that I'm crunching through just won't be able to fit in memory during
>>> the reduce tasks on two machines (max of 2 reduce tasks on each
>>> machine). Is this expected? If I had a large hadoop cluster, then I
>>> could increase the number of reduce tasks on each machine of the
>>> cluster so that not all of the data to be processed is occurring in
>>> just 4 JVMs on two machines like I currently have setup, correct? Is
>>> there any way to get the reduce task to not try and hold all of the
>>> data in memory, or is my only option to add more nodes to the cluster
>>> to therefore increase the number of reduce tasks?
>>>
>>> Thanks!
>>>
>>> Ryan
>>>
>

Re: Reduce tasks running out of memory on small hadoop cluster

Posted by Ryan LeCompte <le...@gmail.com>.
Yes I did, but that didn't solve my problem since I'm working with a  
fairly large data set (8gb).

Thanks,
Ryan




On Sep 21, 2008, at 12:22 AM, Sandy <sn...@gmail.com> wrote:

> Have you increased the heapsize in conf/hadoop-env.sh to 2000? This  
> helped
> me some, but eventually I had to upgrade to a system with more memory.
>
> -SM
>
>
> On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <le...@gmail.com>  
> wrote:
>
>> Hello all,
>>
>> I'm setting up a small 3 node hadoop cluster (1 node for
>> namenode/jobtracker and the other two for datanode/tasktracker). The
>> map tasks finish fine, but the reduce tasks are failing at about 30%
>> with an out of memory error. My guess is because the amount of data
>> that I'm crunching through just won't be able to fit in memory during
>> the reduce tasks on two machines (max of 2 reduce tasks on each
>> machine). Is this expected? If I had a large hadoop cluster, then I
>> could increase the number of reduce tasks on each machine of the
>> cluster so that not all of the data to be processed is occurring in
>> just 4 JVMs on two machines like I currently have setup, correct? Is
>> there any way to get the reduce task to not try and hold all of the
>> data in memory, or is my only option to add more nodes to the cluster
>> to therefore increase the number of reduce tasks?
>>
>> Thanks!
>>
>> Ryan
>>

Re: Reduce tasks running out of memory on small hadoop cluster

Posted by Sandy <sn...@gmail.com>.
Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped
me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte <le...@gmail.com> wrote:

> Hello all,
>
> I'm setting up a small 3 node hadoop cluster (1 node for
> namenode/jobtracker and the other two for datanode/tasktracker). The
> map tasks finish fine, but the reduce tasks are failing at about 30%
> with an out of memory error. My guess is because the amount of data
> that I'm crunching through just won't be able to fit in memory during
> the reduce tasks on two machines (max of 2 reduce tasks on each
> machine). Is this expected? If I had a large hadoop cluster, then I
> could increase the number of reduce tasks on each machine of the
> cluster so that not all of the data to be processed is occurring in
> just 4 JVMs on two machines like I currently have setup, correct? Is
> there any way to get the reduce task to not try and hold all of the
> data in memory, or is my only option to add more nodes to the cluster
> to therefore increase the number of reduce tasks?
>
> Thanks!
>
> Ryan
>