You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by xiaolin guo <xi...@hulu.com> on 2009/04/07 16:40:55 UTC

Too many fetch errors

I am trying to setup a small hadoop cluster , everything was ok before I
moved from single node cluster to two-node cluster. I followed the article
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>to
config master and slaves.However, when I tried to run the example
wordcount map-reduce application , the reduce task got stuck in 19% for a
log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
attempt_200904072219_0001_m_000002_0, Status : FAILED too many fetch
errors"  and an error message : Error reading task outputslave.

All map tasks in both task nodes had been finished which could be verified
in task tracker pages.

Both nodes work well in single node mode . And the Hadoop file system seems
to be healthy in multi-node mode.

Can anyone help me with this issue?  Have already got entangled in this
issue for a long time ...

Thanks very much!

Re: Too many fetch errors

Posted by xiaolin guo <xi...@hulu.com>.
Fixed the problem
The problem is that the one of the nodes can not resolve the name of the
other node.  Even if I use ip address in the masters and slaves , hadoop
will use the name of the node instead of the ip address ...

On Wed, Apr 8, 2009 at 7:26 PM, xiaolin guo <xi...@hulu.com> wrote:

> I have checked the log and found that  for each map task , there are 3
> failures which look like machin1(failed) -> machine2(failed) ->
> machine1(failed) -> machine2(succeeded). All failures are "Too many fetch
> failures". And i am sure there is no firewall between the two nodes , at
> least port 50060 can be accessed from web browser.
>
> How can I check whether two nodes can fetch mapper outputs from one
> another?  I have no idea how reducers fetch these data ...
>
> Thanks!
>
>
> On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball <aa...@cloudera.com> wrote:
>
>> Xiaolin,
>>
>> Are you certain that the two nodes can fetch mapper outputs from one
>> another? If it's taking that long to complete, it might be the case that
>> what makes it "complete" is just that eventually it abandons one of your
>> two
>> nodes and runs everything on a single node where it succeeds -- defeating
>> the point, of course.
>>
>> Might there be a firewall between the two nodes that blocks the port used
>> by
>> the reducer to fetch the mapper outputs? (I think this is on 50060 by
>> default.)
>>
>> - Aaron
>>
>> On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo <xi...@hulu.com> wrote:
>>
>> > This simple map-recude application will take nearly 1 hour to finish
>> > running
>> > on the two-node cluster ,due to lots of Failed/Killed task attempts,
>> while
>> > in the single node cluster this application only takes 1 minite ... I am
>> > quite confusing why there are so many Failed/Killed attempts ..
>> >
>> > On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo <xi...@hulu.com> wrote:
>> >
>> > > I am trying to setup a small hadoop cluster , everything was ok before
>> I
>> > > moved from single node cluster to two-node cluster. I followed the
>> > article
>> > >
>> >
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
>> <
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> >
>> > <
>> >
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> >to
>> > config master and slaves.However, when I tried to run the example
>> > > wordcount map-reduce application , the reduce task got stuck in 19%
>> for a
>> > > log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
>> > > attempt_200904072219_0001_m_000002_0, Status : FAILED too many fetch
>> > > errors"  and an error message : Error reading task outputslave.
>> > >
>> > > All map tasks in both task nodes had been finished which could be
>> > verified
>> > > in task tracker pages.
>> > >
>> > > Both nodes work well in single node mode . And the Hadoop file system
>> > seems
>> > > to be healthy in multi-node mode.
>> > >
>> > > Can anyone help me with this issue?  Have already got entangled in
>> this
>> > > issue for a long time ...
>> > >
>> > > Thanks very much!
>> > >
>> >
>>
>
>

Re: Too many fetch errors

Posted by xiaolin guo <xi...@hulu.com>.
I have checked the log and found that  for each map task , there are 3
failures which look like machin1(failed) -> machine2(failed) ->
machine1(failed) -> machine2(succeeded). All failures are "Too many fetch
failures". And i am sure there is no firewall between the two nodes , at
least port 50060 can be accessed from web browser.

How can I check whether two nodes can fetch mapper outputs from one
another?  I have no idea how reducers fetch these data ...

Thanks!

On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Xiaolin,
>
> Are you certain that the two nodes can fetch mapper outputs from one
> another? If it's taking that long to complete, it might be the case that
> what makes it "complete" is just that eventually it abandons one of your
> two
> nodes and runs everything on a single node where it succeeds -- defeating
> the point, of course.
>
> Might there be a firewall between the two nodes that blocks the port used
> by
> the reducer to fetch the mapper outputs? (I think this is on 50060 by
> default.)
>
> - Aaron
>
> On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo <xi...@hulu.com> wrote:
>
> > This simple map-recude application will take nearly 1 hour to finish
> > running
> > on the two-node cluster ,due to lots of Failed/Killed task attempts,
> while
> > in the single node cluster this application only takes 1 minite ... I am
> > quite confusing why there are so many Failed/Killed attempts ..
> >
> > On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo <xi...@hulu.com> wrote:
> >
> > > I am trying to setup a small hadoop cluster , everything was ok before
> I
> > > moved from single node cluster to two-node cluster. I followed the
> > article
> > >
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
> <
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >
> > <
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >to
> > config master and slaves.However, when I tried to run the example
> > > wordcount map-reduce application , the reduce task got stuck in 19% for
> a
> > > log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
> > > attempt_200904072219_0001_m_000002_0, Status : FAILED too many fetch
> > > errors"  and an error message : Error reading task outputslave.
> > >
> > > All map tasks in both task nodes had been finished which could be
> > verified
> > > in task tracker pages.
> > >
> > > Both nodes work well in single node mode . And the Hadoop file system
> > seems
> > > to be healthy in multi-node mode.
> > >
> > > Can anyone help me with this issue?  Have already got entangled in this
> > > issue for a long time ...
> > >
> > > Thanks very much!
> > >
> >
>

Re: Too many fetch errors

Posted by Aaron Kimball <aa...@cloudera.com>.
Xiaolin,

Are you certain that the two nodes can fetch mapper outputs from one
another? If it's taking that long to complete, it might be the case that
what makes it "complete" is just that eventually it abandons one of your two
nodes and runs everything on a single node where it succeeds -- defeating
the point, of course.

Might there be a firewall between the two nodes that blocks the port used by
the reducer to fetch the mapper outputs? (I think this is on 50060 by
default.)

- Aaron

On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo <xi...@hulu.com> wrote:

> This simple map-recude application will take nearly 1 hour to finish
> running
> on the two-node cluster ,due to lots of Failed/Killed task attempts, while
> in the single node cluster this application only takes 1 minite ... I am
> quite confusing why there are so many Failed/Killed attempts ..
>
> On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo <xi...@hulu.com> wrote:
>
> > I am trying to setup a small hadoop cluster , everything was ok before I
> > moved from single node cluster to two-node cluster. I followed the
> article
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
> <
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>to
> config master and slaves.However, when I tried to run the example
> > wordcount map-reduce application , the reduce task got stuck in 19% for a
> > log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
> > attempt_200904072219_0001_m_000002_0, Status : FAILED too many fetch
> > errors"  and an error message : Error reading task outputslave.
> >
> > All map tasks in both task nodes had been finished which could be
> verified
> > in task tracker pages.
> >
> > Both nodes work well in single node mode . And the Hadoop file system
> seems
> > to be healthy in multi-node mode.
> >
> > Can anyone help me with this issue?  Have already got entangled in this
> > issue for a long time ...
> >
> > Thanks very much!
> >
>

Re: Too many fetch errors

Posted by xiaolin guo <xi...@hulu.com>.
This simple map-recude application will take nearly 1 hour to finish running
on the two-node cluster ,due to lots of Failed/Killed task attempts, while
in the single node cluster this application only takes 1 minite ... I am
quite confusing why there are so many Failed/Killed attempts ..

On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo <xi...@hulu.com> wrote:

> I am trying to setup a small hadoop cluster , everything was ok before I
> moved from single node cluster to two-node cluster. I followed the article
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>to config master and slaves.However, when I tried to run the example
> wordcount map-reduce application , the reduce task got stuck in 19% for a
> log time . Then I got a notice:"INFO mapred.JobClient: TaskId :
> attempt_200904072219_0001_m_000002_0, Status : FAILED too many fetch
> errors"  and an error message : Error reading task outputslave.
>
> All map tasks in both task nodes had been finished which could be verified
> in task tracker pages.
>
> Both nodes work well in single node mode . And the Hadoop file system seems
> to be healthy in multi-node mode.
>
> Can anyone help me with this issue?  Have already got entangled in this
> issue for a long time ...
>
> Thanks very much!
>