You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Suhail Rehman <su...@gmail.com> on 2010/01/20 13:37:56 UTC

Reducers are stuck fetching map data.

We are having trouble running Hadoop MapReduce jobs on our cluster.

VMs running on an IBM blade center with the following virtualized
configuration:

Master Node/Namenode: 1x
OS: Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
Slaves/DataNode: 3x
OS: Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM

We are working with standard Hadoop example code. We are using Hadoop
0.20.1, stable with the latest patches installed. All VMs have firewalls
turned off as well as SELinux disabled.

For example, while we try to execute the "wordcount" program on a
provisioned cluster, the Map operations complete successfully, the program
is stuck trying to complete the reduce operations.

On examining the logs, we find that the Reducers are waiting for the outputs
from Map operations on other nodes. Our understanding is that this
communication happens over HTTP sockets and all these provisioned VMs have
trouble communicating over the HTTP sockets on the ports that Hadoop uses.

Also, while trying to access the JobTracker web interface to view the
running jobs, we see that the machine is taking too much time to respond to
our queries. Since both of the Reducer communication and the JobTracker web
interface works over HTTP, we think the problem might be a networking issue
or a problem with the built-in HTTP service in Hadoop (Jetty).

Attached is a partial Task log from one of the Reducers,
"WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException:
Read timed out"
appears on all reducers, and eventually the Job either fails to complete or
takes a very long time (about 15 hours to process a 11 GB text file).

This problem seems to be random and at times the program runs sucessfully in
about 20 mins, othertimes it completes the operation in 15 hours.

Any help with regards to this would be much appreciated.

Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

Re: Reducers are stuck fetching map data.

Posted by Suhail Rehman <su...@gmail.com>.
I don't know about map runtime being high as usually it's completed in a few
minutes. This problem comes up for almost every input size (512KB to 1MB  to
11 GB) and application (wordcount, pi, custom imaging application). I'll
check the bug fix anyway.

Suhail

On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu <
amarsri@yahoo-inc.com> wrote:

>  ReadTimeOuts are found to be costly during shuffle, if the map runtime is
> high.
> Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327)
> for shuffle improvements done for ReadTimeOut specificlly
>
> Thanks
> Amareshwari
>
>
> On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:
>
> We are having trouble running Hadoop MapReduce jobs on our cluster.
>
> VMs running on an IBM blade center with the following virtualized
> configuration:
>
> Master Node/Namenode: 1x
> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
> Slaves/DataNode: 3x
> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
>
> We are working with standard Hadoop example code. We are using Hadoop
> 0.20.1, stable with the latest patches installed. All VMs have firewalls
> turned off as well as SELinux disabled.
>
> For example, while we try to execute the "wordcount" program on a
> provisioned cluster, the Map operations complete successfully, the program
> is stuck trying to complete the reduce operations.
>
> On examining the logs, we find that the Reducers are waiting for the
> outputs from Map operations on other nodes. Our understanding is that this
> communication happens over HTTP sockets and all these provisioned VMs have
> trouble communicating over the HTTP sockets on the ports that Hadoop uses.
>
> Also, while trying to access the JobTracker web interface to view the
> running jobs, we see that the machine is taking too much time to respond to
> our queries. Since both of the Reducer communication and the JobTracker web
> interface works over HTTP, we think the problem might be a networking issue
> or a problem with the built-in HTTP service in Hadoop (Jetty).
>
> Attached is a partial Task log from one of the Reducers,
> "WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException:
> Read timed out"
> appears on all reducers, and eventually the Job either fails to complete or
> takes a very long time (about 15 hours to process a 11 GB text file).
>
> This problem seems to be random and at times the program runs sucessfully
> in about 20 mins, othertimes it completes the operation in 15 hours.
>
> Any help with regards to this would be much appreciated.
>
> Regards,
>
> Suhail Rehman
> MS by Research in Computer Science
> International Institute of Information Technology - Hyderabad
> rehman@research.iiit.ac.in
> ---------------------------------------------------------------------
> http://research.iiit.ac.in/~rehman <http://research.iiit.ac.in/%7Erehman>
>
>


-- 
Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

Re: Reducers are stuck fetching map data.

Posted by Suhail Rehman <su...@gmail.com>.
Yes, will be immensely helpful for others.

Suhail

On Tue, Jan 26, 2010 at 9:52 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> You mean that documentation?
>
> http://hadoop.apache.org/common/docs/r0.20.1/quickstart.html#Required+Software
>
> J-D
>
> On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman <su...@gmail.com>
> wrote:
> > We finally figured it out! The problem was with the JDK installation on
> our
> > VMs, it was configured to use IBM JDK, and the moment we switched to Sun,
> > everything now works flawlessly.
> >
> > You may want to include this information somewhere in the documentation
> that
> > you strongly recommend Sun JDK to be used with Hadoop.
> >
> > Suhail
> >
> > On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman <su...@gmail.com>
> > wrote:
> >>
> >> We have verified that it does NOT solve the problem at all.  This would
> >> lead us to believe that the timeout issue we are experiencing is not
> part of
> >> the shuffle phase. Any other ideas that might help us?
> >>
> >> The Tasktracker logs show that these reducers are stuck during the copy
> >> phase.
> >>
> >> Suhail
> >>
> >> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu
> >> <am...@yahoo-inc.com> wrote:
> >>>
> >>> ReadTimeOuts are found to be costly during shuffle, if the map runtime
> is
> >>> high.
> >>> Please see HADOOP-3327(
> http://issues.apache.org/jira/browse/HADOOP-3327)
> >>> for shuffle improvements done for ReadTimeOut specificlly
> >>>
> >>> Thanks
> >>> Amareshwari
> >>>
> >>> On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:
> >>>
> >>> We are having trouble running Hadoop MapReduce jobs on our cluster.
> >>>
> >>> VMs running on an IBM blade center with the following virtualized
> >>> configuration:
> >>>
> >>> Master Node/Namenode: 1x
> >>> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
> >>> Slaves/DataNode: 3x
> >>> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
> >>>
> >>> We are working with standard Hadoop example code. We are using Hadoop
> >>> 0.20.1, stable with the latest patches installed. All VMs have
> firewalls
> >>> turned off as well as SELinux disabled.
> >>>
> >>> For example, while we try to execute the "wordcount" program on a
> >>> provisioned cluster, the Map operations complete successfully, the
> program
> >>> is stuck trying to complete the reduce operations.
> >>>
> >>> On examining the logs, we find that the Reducers are waiting for the
> >>> outputs from Map operations on other nodes. Our understanding is that
> this
> >>> communication happens over HTTP sockets and all these provisioned VMs
> have
> >>> trouble communicating over the HTTP sockets on the ports that Hadoop
> uses.
> >>>
> >>> Also, while trying to access the JobTracker web interface to view the
> >>> running jobs, we see that the machine is taking too much time to
> respond to
> >>> our queries. Since both of the Reducer communication and the JobTracker
> web
> >>> interface works over HTTP, we think the problem might be a networking
> issue
> >>> or a problem with the built-in HTTP service in Hadoop (Jetty).
> >>>
> >>> Attached is a partial Task log from one of the Reducers,
> >>> "WARN org.apache.hadoop.mapred.ReduceTask:
> >>> java.net.SocketTimeoutException: Read timed out"
> >>> appears on all reducers, and eventually the Job either fails to
> complete
> >>> or takes a very long time (about 15 hours to process a 11 GB text
> file).
> >>>
> >>> This problem seems to be random and at times the program runs
> sucessfully
> >>> in about 20 mins, othertimes it completes the operation in 15 hours.
> >>>
> >>> Any help with regards to this would be much appreciated.
> >>>
> >>> Regards,
> >>>
> >>> Suhail Rehman
> >>> MS by Research in Computer Science
> >>> International Institute of Information Technology - Hyderabad
> >>> rehman@research.iiit.ac.in
> >>> ---------------------------------------------------------------------
> >>> http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> Suhail Rehman
> >> MS by Research in Computer Science
> >> International Institute of Information Technology - Hyderabad
> >> rehman@research.iiit.ac.in
> >> ---------------------------------------------------------------------
> >> http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >
> >
> >
> > --
> > Regards,
> >
> > Suhail Rehman
> > MS by Research in Computer Science
> > International Institute of Information Technology - Hyderabad
> > rehman@research.iiit.ac.in
> > ---------------------------------------------------------------------
> > http://research.iiit.ac.in/~rehman<http://research.iiit.ac.in/%7Erehman>
> >
>



-- 
Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

Re: Reducers are stuck fetching map data.

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You mean that documentation?
http://hadoop.apache.org/common/docs/r0.20.1/quickstart.html#Required+Software

J-D

On Tue, Jan 26, 2010 at 1:34 AM, Suhail Rehman <su...@gmail.com> wrote:
> We finally figured it out! The problem was with the JDK installation on our
> VMs, it was configured to use IBM JDK, and the moment we switched to Sun,
> everything now works flawlessly.
>
> You may want to include this information somewhere in the documentation that
> you strongly recommend Sun JDK to be used with Hadoop.
>
> Suhail
>
> On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman <su...@gmail.com>
> wrote:
>>
>> We have verified that it does NOT solve the problem at all.  This would
>> lead us to believe that the timeout issue we are experiencing is not part of
>> the shuffle phase. Any other ideas that might help us?
>>
>> The Tasktracker logs show that these reducers are stuck during the copy
>> phase.
>>
>> Suhail
>>
>> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu
>> <am...@yahoo-inc.com> wrote:
>>>
>>> ReadTimeOuts are found to be costly during shuffle, if the map runtime is
>>> high.
>>> Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327)
>>> for shuffle improvements done for ReadTimeOut specificlly
>>>
>>> Thanks
>>> Amareshwari
>>>
>>> On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:
>>>
>>> We are having trouble running Hadoop MapReduce jobs on our cluster.
>>>
>>> VMs running on an IBM blade center with the following virtualized
>>> configuration:
>>>
>>> Master Node/Namenode: 1x
>>> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
>>> Slaves/DataNode: 3x
>>> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
>>>
>>> We are working with standard Hadoop example code. We are using Hadoop
>>> 0.20.1, stable with the latest patches installed. All VMs have firewalls
>>> turned off as well as SELinux disabled.
>>>
>>> For example, while we try to execute the "wordcount" program on a
>>> provisioned cluster, the Map operations complete successfully, the program
>>> is stuck trying to complete the reduce operations.
>>>
>>> On examining the logs, we find that the Reducers are waiting for the
>>> outputs from Map operations on other nodes. Our understanding is that this
>>> communication happens over HTTP sockets and all these provisioned VMs have
>>> trouble communicating over the HTTP sockets on the ports that Hadoop uses.
>>>
>>> Also, while trying to access the JobTracker web interface to view the
>>> running jobs, we see that the machine is taking too much time to respond to
>>> our queries. Since both of the Reducer communication and the JobTracker web
>>> interface works over HTTP, we think the problem might be a networking issue
>>> or a problem with the built-in HTTP service in Hadoop (Jetty).
>>>
>>> Attached is a partial Task log from one of the Reducers,
>>> "WARN org.apache.hadoop.mapred.ReduceTask:
>>> java.net.SocketTimeoutException: Read timed out"
>>> appears on all reducers, and eventually the Job either fails to complete
>>> or takes a very long time (about 15 hours to process a 11 GB text file).
>>>
>>> This problem seems to be random and at times the program runs sucessfully
>>> in about 20 mins, othertimes it completes the operation in 15 hours.
>>>
>>> Any help with regards to this would be much appreciated.
>>>
>>> Regards,
>>>
>>> Suhail Rehman
>>> MS by Research in Computer Science
>>> International Institute of Information Technology - Hyderabad
>>> rehman@research.iiit.ac.in
>>> ---------------------------------------------------------------------
>>> http://research.iiit.ac.in/~rehman
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Suhail Rehman
>> MS by Research in Computer Science
>> International Institute of Information Technology - Hyderabad
>> rehman@research.iiit.ac.in
>> ---------------------------------------------------------------------
>> http://research.iiit.ac.in/~rehman
>
>
>
> --
> Regards,
>
> Suhail Rehman
> MS by Research in Computer Science
> International Institute of Information Technology - Hyderabad
> rehman@research.iiit.ac.in
> ---------------------------------------------------------------------
> http://research.iiit.ac.in/~rehman
>

Re: Reducers are stuck fetching map data.

Posted by Suhail Rehman <su...@gmail.com>.
We finally figured it out! The problem was with the JDK installation on our
VMs, it was configured to use IBM JDK, and the moment we switched to Sun,
everything now works flawlessly.

You may want to include this information somewhere in the documentation that
you *strongly recommend *Sun JDK to be used with Hadoop.

Suhail

On Thu, Jan 21, 2010 at 1:13 PM, Suhail Rehman <su...@gmail.com>wrote:

>
> We have verified that it does NOT solve the problem at all.  This would
> lead us to believe that the timeout issue we are experiencing is not part of
> the shuffle phase. Any other ideas that might help us?
>
> The Tasktracker logs show that these reducers are stuck during the copy
> phase.
>
> Suhail
>
>
> On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu <
> amarsri@yahoo-inc.com> wrote:
>
>>  ReadTimeOuts are found to be costly during shuffle, if the map runtime
>> is high.
>> Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327)
>> for shuffle improvements done for ReadTimeOut specificlly
>>
>> Thanks
>> Amareshwari
>>
>>
>> On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:
>>
>> We are having trouble running Hadoop MapReduce jobs on our cluster.
>>
>> VMs running on an IBM blade center with the following virtualized
>> configuration:
>>
>> Master Node/Namenode: 1x
>> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
>> Slaves/DataNode: 3x
>> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
>>
>> We are working with standard Hadoop example code. We are using Hadoop
>> 0.20.1, stable with the latest patches installed. All VMs have firewalls
>> turned off as well as SELinux disabled.
>>
>> For example, while we try to execute the "wordcount" program on a
>> provisioned cluster, the Map operations complete successfully, the program
>> is stuck trying to complete the reduce operations.
>>
>> On examining the logs, we find that the Reducers are waiting for the
>> outputs from Map operations on other nodes. Our understanding is that this
>> communication happens over HTTP sockets and all these provisioned VMs have
>> trouble communicating over the HTTP sockets on the ports that Hadoop uses.
>>
>> Also, while trying to access the JobTracker web interface to view the
>> running jobs, we see that the machine is taking too much time to respond to
>> our queries. Since both of the Reducer communication and the JobTracker web
>> interface works over HTTP, we think the problem might be a networking issue
>> or a problem with the built-in HTTP service in Hadoop (Jetty).
>>
>> Attached is a partial Task log from one of the Reducers,
>> "WARN org.apache.hadoop.mapred.ReduceTask:
>> java.net.SocketTimeoutException: Read timed out"
>> appears on all reducers, and eventually the Job either fails to complete
>> or takes a very long time (about 15 hours to process a 11 GB text file).
>>
>> This problem seems to be random and at times the program runs sucessfully
>> in about 20 mins, othertimes it completes the operation in 15 hours.
>>
>> Any help with regards to this would be much appreciated.
>>
>> Regards,
>>
>> Suhail Rehman
>> MS by Research in Computer Science
>> International Institute of Information Technology - Hyderabad
>> rehman@research.iiit.ac.in
>> ---------------------------------------------------------------------
>> http://research.iiit.ac.in/~rehman <http://research.iiit.ac.in/%7Erehman>
>>
>>
>
>
> --
> Regards,
>
> Suhail Rehman
> MS by Research in Computer Science
> International Institute of Information Technology - Hyderabad
> rehman@research.iiit.ac.in
> ---------------------------------------------------------------------
> http://research.iiit.ac.in/~rehman <http://research.iiit.ac.in/%7Erehman>
>



-- 
Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

Re: Reducers are stuck fetching map data.

Posted by Suhail Rehman <su...@gmail.com>.
We have verified that it does NOT solve the problem at all.  This would lead
us to believe that the timeout issue we are experiencing is not part of the
shuffle phase. Any other ideas that might help us?

The Tasktracker logs show that these reducers are stuck during the copy
phase.

Suhail

On Wed, Jan 20, 2010 at 5:22 PM, Amareshwari Sri Ramadasu <
amarsri@yahoo-inc.com> wrote:

>  ReadTimeOuts are found to be costly during shuffle, if the map runtime is
> high.
> Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327)
> for shuffle improvements done for ReadTimeOut specificlly
>
> Thanks
> Amareshwari
>
>
> On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:
>
> We are having trouble running Hadoop MapReduce jobs on our cluster.
>
> VMs running on an IBM blade center with the following virtualized
> configuration:
>
> Master Node/Namenode: 1x
> OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
> Slaves/DataNode: 3x
> OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM
>
> We are working with standard Hadoop example code. We are using Hadoop
> 0.20.1, stable with the latest patches installed. All VMs have firewalls
> turned off as well as SELinux disabled.
>
> For example, while we try to execute the "wordcount" program on a
> provisioned cluster, the Map operations complete successfully, the program
> is stuck trying to complete the reduce operations.
>
> On examining the logs, we find that the Reducers are waiting for the
> outputs from Map operations on other nodes. Our understanding is that this
> communication happens over HTTP sockets and all these provisioned VMs have
> trouble communicating over the HTTP sockets on the ports that Hadoop uses.
>
> Also, while trying to access the JobTracker web interface to view the
> running jobs, we see that the machine is taking too much time to respond to
> our queries. Since both of the Reducer communication and the JobTracker web
> interface works over HTTP, we think the problem might be a networking issue
> or a problem with the built-in HTTP service in Hadoop (Jetty).
>
> Attached is a partial Task log from one of the Reducers,
> "WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException:
> Read timed out"
> appears on all reducers, and eventually the Job either fails to complete or
> takes a very long time (about 15 hours to process a 11 GB text file).
>
> This problem seems to be random and at times the program runs sucessfully
> in about 20 mins, othertimes it completes the operation in 15 hours.
>
> Any help with regards to this would be much appreciated.
>
> Regards,
>
> Suhail Rehman
> MS by Research in Computer Science
> International Institute of Information Technology - Hyderabad
> rehman@research.iiit.ac.in
> ---------------------------------------------------------------------
> http://research.iiit.ac.in/~rehman <http://research.iiit.ac.in/%7Erehman>
>
>


-- 
Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman

Re: Reducers are stuck fetching map data.

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
ReadTimeOuts are found to be costly during shuffle, if the map runtime is high.
Please see HADOOP-3327( http://issues.apache.org/jira/browse/HADOOP-3327) for shuffle improvements done for ReadTimeOut specificlly

Thanks
Amareshwari

On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:

We are having trouble running Hadoop MapReduce jobs on our cluster.

VMs running on an IBM blade center with the following virtualized configuration:

Master Node/Namenode: 1x
OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
Slaves/DataNode: 3x
OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM

We are working with standard Hadoop example code. We are using Hadoop 0.20.1, stable with the latest patches installed. All VMs have firewalls turned off as well as SELinux disabled.

For example, while we try to execute the "wordcount" program on a provisioned cluster, the Map operations complete successfully, the program is stuck trying to complete the reduce operations.

On examining the logs, we find that the Reducers are waiting for the outputs from Map operations on other nodes. Our understanding is that this communication happens over HTTP sockets and all these provisioned VMs have trouble communicating over the HTTP sockets on the ports that Hadoop uses.

Also, while trying to access the JobTracker web interface to view the running jobs, we see that the machine is taking too much time to respond to our queries. Since both of the Reducer communication and the JobTracker web interface works over HTTP, we think the problem might be a networking issue or a problem with the built-in HTTP service in Hadoop (Jetty).

Attached is a partial Task log from one of the Reducers,
"WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out"
appears on all reducers, and eventually the Job either fails to complete or takes a very long time (about 15 hours to process a 11 GB text file).

This problem seems to be random and at times the program runs sucessfully in about 20 mins, othertimes it completes the operation in 15 hours.

Any help with regards to this would be much appreciated.

Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman


Re: Reducers are stuck fetching map data.

Posted by Rekha Joshi <re...@yahoo-inc.com>.
Could be a network issue, however try setting mapred.task.timeout(ms) , mapred.child.ulimit(-Xmx)  parameter if it helps. Refer below for other memory parameters -

http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html
http://hadoop.apache.org/common/docs/r0.20.1/cluster_setup.html

Cheers,
/R

On 1/20/10 6:07 PM, "Suhail Rehman" <su...@gmail.com> wrote:

We are having trouble running Hadoop MapReduce jobs on our cluster.

VMs running on an IBM blade center with the following virtualized configuration:

Master Node/Namenode: 1x
OS:                 Xen RedHat Linux 5.2, CPU : 3 vCPU, RAM: 1024 MB
Slaves/DataNode: 3x
OS:                 Xen RedHat Linux 5.2 1 vCPU, 1024 MB RAM

We are working with standard Hadoop example code. We are using Hadoop 0.20.1, stable with the latest patches installed. All VMs have firewalls turned off as well as SELinux disabled.

For example, while we try to execute the "wordcount" program on a provisioned cluster, the Map operations complete successfully, the program is stuck trying to complete the reduce operations.

On examining the logs, we find that the Reducers are waiting for the outputs from Map operations on other nodes. Our understanding is that this communication happens over HTTP sockets and all these provisioned VMs have trouble communicating over the HTTP sockets on the ports that Hadoop uses.

Also, while trying to access the JobTracker web interface to view the running jobs, we see that the machine is taking too much time to respond to our queries. Since both of the Reducer communication and the JobTracker web interface works over HTTP, we think the problem might be a networking issue or a problem with the built-in HTTP service in Hadoop (Jetty).

Attached is a partial Task log from one of the Reducers,
"WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out"
appears on all reducers, and eventually the Job either fails to complete or takes a very long time (about 15 hours to process a 11 GB text file).

This problem seems to be random and at times the program runs sucessfully in about 20 mins, othertimes it completes the operation in 15 hours.

Any help with regards to this would be much appreciated.

Regards,

Suhail Rehman
MS by Research in Computer Science
International Institute of Information Technology - Hyderabad
rehman@research.iiit.ac.in
---------------------------------------------------------------------
http://research.iiit.ac.in/~rehman