You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ming Yang <mi...@gmail.com> on 2007/10/19 05:05:18 UTC

Map task failure recovery

Hi,

In the original MapReduce paper from Google, it mentioned
that healthy workers can take over failed task from other
workers. Does Hadoop has the same failure recovery strategy?
Also the other question is, in the paper, it seems the nodes can
be added/removed while the cluster is running jobs. How does
Hadoop achieve this? Since the slave locations are saved in the
file and the master doesn't know about new nodes until it
restart and reload the slave list.

Thanks,

Ming Yang

RE: Simulated Nodes

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

I have done this earlier by starting more than one Datanode on the same
machine manually. Each instance of the Datanode has to have its own
configuration directory. In the configuration,  the names of the
dfs.data.dir have to be different from one another.

Thanks,
dhruba

-----Original Message-----
From: Lance Amundsen [mailto:lca13@us.ibm.com] 
Sent: Friday, October 19, 2007 11:21 PM
To: hadoop-user@lucene.apache.org
Subject: Simulated Nodes

How can I get more "real", but simulated Hadoop nodes for testing,
without
more machines?  Can I add duplicate names in the slaves file?  Or do I
need
separate IP's for everything?

This is just for testing purposes.  Ideally, I'd like to simulate a few
nodes on my Windows laptop, to make it easier to debug actual
distributed
stuff.

Maybe if I am clever with Cygwin, I could get several sessions
communicating with each other using virtual IP's, albeit all through
localhost.  But would I have issues with a single hadoop directory
structure if I did this?

Simulated Nodes

Posted by Lance Amundsen <lc...@us.ibm.com>.

How can I get more "real", but simulated Hadoop nodes for testing, without
more machines?  Can I add duplicate names in the slaves file?  Or do I need
separate IP's for everything?

This is just for testing purposes.  Ideally, I'd like to simulate a few
nodes on my Windows laptop, to make it easier to debug actual distributed
stuff.

Maybe if I am clever with Cygwin, I could get several sessions
communicating with each other using virtual IP's, albeit all through
localhost.  But would I have issues with a single hadoop directory
structure if I did this?

Re: Map task failure recovery

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Nguyen Manh Tien wrote:
> Owen, Could you show me how to start additional data nodes or tasktrackers ?
> 

On a new node that you want to be a part of the cluster run:

$ $HADOOP_HOME/bin/hadoop-daemon.sh start datanode

or

$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

Arun

> 
> 2007/10/19, Owen O'Malley <oo...@yahoo-inc.com>:
> 
>>On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:
>>
>>
>>>In the original MapReduce paper from Google, it mentioned
>>>that healthy workers can take over failed task from other
>>>workers. Does Hadoop has the same failure recovery strategy?
>>
>>Yes. If a task fails on one node, it is assigned to another free node
>>automatically.
>>
>>
>>>Also the other question is, in the paper, it seems the nodes can
>>>be added/removed while the cluster is running jobs. How does
>>>Hadoop achieve this? Since the slave locations are saved in the
>>>file and the master doesn't know about new nodes until it
>>>restart and reload the slave list.
>>
>>The slaves file is only used by the startup scripts when bringing up
>>the cluster. If additional data nodes or task trackers (ie. slaves)
>>are started they automatically join the cluster and will be given
>>work. If the servers on one of the slaves are killed, the work will
>>be redone on other nodes.
>>
>>-- Owen
>>
>>
>>
> 
>

Re: Map task failure recovery

Posted by Nguyen Manh Tien <ti...@gmail.com>.

Owen, Could you show me how to start additional data nodes or tasktrackers ?


2007/10/19, Owen O'Malley <oo...@yahoo-inc.com>:
>
> On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:
>
> > In the original MapReduce paper from Google, it mentioned
> > that healthy workers can take over failed task from other
> > workers. Does Hadoop has the same failure recovery strategy?
>
> Yes. If a task fails on one node, it is assigned to another free node
> automatically.
>
> > Also the other question is, in the paper, it seems the nodes can
> > be added/removed while the cluster is running jobs. How does
> > Hadoop achieve this? Since the slave locations are saved in the
> > file and the master doesn't know about new nodes until it
> > restart and reload the slave list.
>
> The slaves file is only used by the startup scripts when bringing up
> the cluster. If additional data nodes or task trackers (ie. slaves)
> are started they automatically join the cluster and will be given
> work. If the servers on one of the slaves are killed, the work will
> be redone on other nodes.
>
> -- Owen
>
>
>

Re: Map task failure recovery

Posted by Ming Yang <mi...@gmail.com>.

Does it happen when running Hadoop locally as a single process?
I was testing my code locally and didn't see failure recovery.
The other thing is, how is "failure" defined? If in map(...) there's
a uncaught exception occurs and the task fails, can this kind of
failure be recovered?

Thanks,

Ming

2007/10/18, Owen O'Malley <oo...@yahoo-inc.com>:
> On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:
>
> > In the original MapReduce paper from Google, it mentioned
> > that healthy workers can take over failed task from other
> > workers. Does Hadoop has the same failure recovery strategy?
>
> Yes. If a task fails on one node, it is assigned to another free node
> automatically.
>

Re: Map task failure recovery

Posted by Owen O'Malley <oo...@yahoo-inc.com>.

On Oct 18, 2007, at 8:05 PM, Ming Yang wrote:

> In the original MapReduce paper from Google, it mentioned
> that healthy workers can take over failed task from other
> workers. Does Hadoop has the same failure recovery strategy?

Yes. If a task fails on one node, it is assigned to another free node  
automatically.

> Also the other question is, in the paper, it seems the nodes can
> be added/removed while the cluster is running jobs. How does
> Hadoop achieve this? Since the slave locations are saved in the
> file and the master doesn't know about new nodes until it
> restart and reload the slave list.

The slaves file is only used by the startup scripts when bringing up  
the cluster. If additional data nodes or task trackers (ie. slaves)  
are started they automatically join the cluster and will be given  
work. If the servers on one of the slaves are killed, the work will  
be redone on other nodes.

-- Owen