You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Natarajan, Senthil" <se...@pitt.edu> on 2008/03/27 15:10:42 UTC

Reduce Hangs

Hi,
I have small Hadoop cluster, one master and three slaves.
When I try the example wordcount on one of our log file (size ~350 MB)

Map runs fine but reduce always hangs (sometime around 19%,60% ...) after very long time it finishes.
I am seeing this error
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
In the log I am seeing this
INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at 0.02 MB/s) >

Do you know what might be the problem.
Thanks,
Senthil

Re: Reduce Hangs

Posted by Mafish Liu <ma...@gmail.com>.

All ports are listed in conf/hadoop-default.xml and conf/hadoop-site.xml.
Also, if you are using hbase, you need to concern about hbase-default.xmland
hbase-site.xml, located in hbase directory.

2008/3/29 Natarajan, Senthil <se...@pitt.edu>:

> Hi,
> Thanks for your suggestions.
>
> It looks like the problem is with firewall, I created the firewall rule to
> allow these ports 50000 to 50100 (I found in these port range hadoop was
> listening)
>
> Looks like I am missing some ports and that gets blocked in the firewall.
>
> Could anyone please let me know, how to configure hadoop to use only
> certain specified ports, so that those ports can be allowed in the firewall.
>
> Thanks,
> Senthil
>
>


-- 
Mafish@gmail.com
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.

RE: Reduce Hangs

Posted by "Natarajan, Senthil" <se...@pitt.edu>.

Hi,
Thanks for your suggestions.

It looks like the problem is with firewall, I created the firewall rule to allow these ports 50000 to 50100 (I found in these port range hadoop was listening)

Looks like I am missing some ports and that gets blocked in the firewall.

Could anyone please let me know, how to configure hadoop to use only certain specified ports, so that those ports can be allowed in the firewall.

Thanks,
Senthil

-----Original Message-----
From: 朱盛凯 [mailto:geniusjash@gmail.com]
Sent: Thursday, March 27, 2008 12:32 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Hangs

Hi,

I met this problem in my cluster before, I think I can share with you some
of my experience.
But it may not work in you case.

The job in my cluster always hung at 16% of reduce. It occured because the
reduce task could not fetch the
map output from other nodes.

In my case, two factors may result in this faliure of communication between
two task trackers.

One is the firewall block the trackers from communications. I solved this by
disabling the firewall.
The other factor is that trackers refer to other nodes by host name only,
but not ip address. I solved this by editing the file /etc/hosts
with mapping from hostname to ip address of all nodes in cluster.

I hope my experience will be helpful for you.

On 3/27/08, Natarajan, Senthil <se...@pitt.edu> wrote:
>
> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after
> very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker:
> task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at
> 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>

Re: Reduce Hangs

Posted by Mafish Liu <ma...@gmail.com>.

On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 <ge...@gmail.com> wrote:

> Hi,
>
> I met this problem in my cluster before, I think I can share with you some
> of my experience.
> But it may not work in you case.
>
> The job in my cluster always hung at 16% of reduce. It occured because the
> reduce task could not fetch the
> map output from other nodes.
>
> In my case, two factors may result in this faliure of communication
> between
> two task trackers.
>
> One is the firewall block the trackers from communications. I solved this
> by
> disabling the firewall.
> The other factor is that trackers refer to other nodes by host name only,
> but not ip address. I solved this by editing the file /etc/hosts
> with mapping from hostname to ip address of all nodes in cluster.


I meet this problem with the same reason too.
Try to host names to all your /etc/hosts files .

>
>
> I hope my experience will be helpful for you.
>
> On 3/27/08, Natarajan, Senthil <se...@pitt.edu> wrote:
> >
> > Hi,
> > I have small Hadoop cluster, one master and three slaves.
> > When I try the example wordcount on one of our log file (size ~350 MB)
> >
> > Map runs fine but reduce always hangs (sometime around 19%,60% ...)
> after
> > very long time it finishes.
> > I am seeing this error
> > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> > In the log I am seeing this
> > INFO org.apache.hadoop.mapred.TaskTracker:
> > task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at
> > 0.02 MB/s) >
> >
> > Do you know what might be the problem.
> > Thanks,
> > Senthil
> >
> >
>



-- 
Mafish@gmail.com
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.

Re: Reduce Hangs

Posted by 朱盛凯 <ge...@gmail.com>.

Hi,

I met this problem in my cluster before, I think I can share with you some
of my experience.
But it may not work in you case.

The job in my cluster always hung at 16% of reduce. It occured because the
reduce task could not fetch the
map output from other nodes.

In my case, two factors may result in this faliure of communication between
two task trackers.

One is the firewall block the trackers from communications. I solved this by
disabling the firewall.
The other factor is that trackers refer to other nodes by host name only,
but not ip address. I solved this by editing the file /etc/hosts
with mapping from hostname to ip address of all nodes in cluster.

I hope my experience will be helpful for you.

On 3/27/08, Natarajan, Senthil <se...@pitt.edu> wrote:
>
> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after
> very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker:
> task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at
> 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>

Re: Reduce Hangs

Posted by Amar Kamat <am...@yahoo-inc.com>.

On Thu, 27 Mar 2008, Natarajan, Senthil wrote:

> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
This error occurs when the reducer fails to fetch map-task-output from 5
unique map tasks. Before considering an attempt as failed the reducer
tries to fetch the map output for 7 times in 5 mins (default config).
In case of the job failure check the following
1. Is this problem common in all the reducers?
2. Are the map tasks same across all the reducers for which the failure is
reported?
3. Is there atleast one map task whose output is successfully fetched?
If the job becomes successful then there might be some problem with the
reducer.
Amar
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>