You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Foss User <fo...@gmail.com> on 2009/04/05 11:48:26 UTC

After a node goes down, I can't run jobs

I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3)
First slave (4) Second Slave (5) Client from where I submit jobs

I brought system no. 4 down by running:

bin/hadoop-daemon.sh stop datanode
bin/hadoop-daemon.sh stop tasktracker

After this I tried running my word count job again and I got this error:

fossist@hadoop-client:~/mcr-wordcount$ hadoop jar
dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob
/fossist/inputs /fossist/output7                       09/04/05
15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 192.168.1.5:50010
09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block
blk_-6478273736277251749_1034
09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused
09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block
blk_-7054779688981181941_1034
09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.net.ConnectException: Connection refused
09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block
blk_-6231549606860519001_1034
09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink 192.168.1.5:50010
09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block
blk_-7060117896593271410_1034
09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block
blk_-7060117896593271410_1034 bad datanode[1] nodes == null
09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations.
Source file "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar"
- Aborting...
java.io.IOException: Bad connect ack with firstBadLink 192.168.1.5:50010
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and
tasktracker. This is a serious concern for me because if I am unable
to run jobs after a certain node goes down, then the purpose of the
cluster is defeated.

Could someone please help me in understanding whether it is a human
error by me or it is a problem in Hadoop? Is there any way to avoid
this?

Please note that I can still read all my data in 'inputs' directory
using the commands like:

fossist@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat
/fossist/inputs/input1.txt

Please help.

Re: After a node goes down, I can't run jobs

Posted by Bill Au <bi...@gmail.com>.

All the heartbeat and timeout interval are configurable.  So you don't need
to decommission a host explicitly.  You can configure both the namenode and
the tasktracker to detect a failed host sooner.  If you decommission a host,
you will have to explicitly put it back into the cluster.
Bill

On Sun, Apr 5, 2009 at 8:52 AM, jason hadoop <ja...@gmail.com> wrote:

> From the 0.19.0 FsNameSystem.java, it looks like the timeout by default is
> 2
> * 3000 + 300000 = 306000msec or 5 minutes 6 seconds.
> If you have configured dfs.hosts.exclude in your hadoop-site.xml to point
> to
> an empty file, that actually exists, you may add the name (as used in the
> slaves file) for the node to that file and run
> *hadoop dfsAdmin -refreshNodes
> *
> The namenode will decomission that node.
>
>    long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) *
> 1000;
>    this.heartbeatRecheckInterval = conf.getInt(
> "heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes
>    this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 *
> heartbeatInterval;
>
>
> On Sun, Apr 5, 2009 at 2:52 AM, Foss User <fo...@gmail.com> wrote:
>
> > On Sun, Apr 5, 2009 at 3:18 PM, Foss User <fo...@gmail.com> wrote:
> > > I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3)
> > > First slave (4) Second Slave (5) Client from where I submit jobs
> > >
> > > I brought system no. 4 down by running:
> > >
> > > bin/hadoop-daemon.sh stop datanode
> > > bin/hadoop-daemon.sh stop tasktracker
> > >
> > > After this I tried running my word count job again and I got this
> error:
> > >
> > > fossist@hadoop-client:~/mcr-wordcount$ hadoop jar
> > > dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob
> > > /fossist/inputs /fossist/output7                       09/04/05
> > > 15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing
> > > the arguments. Applications should implement Tool for the same.
> > > 09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in
> > > createBlockOutputStream java.io.IOException: Bad connect ack with
> > > firstBadLink 192.168.1.5:50010
> > > 09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block
> > > blk_-6478273736277251749_1034
> > > 09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in
> > > createBlockOutputStream java.net.ConnectException: Connection refused
> > > 09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block
> > > blk_-7054779688981181941_1034
> > > 09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in
> > > createBlockOutputStream java.net.ConnectException: Connection refused
> > > 09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block
> > > blk_-6231549606860519001_1034
> > > 09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in
> > > createBlockOutputStream java.io.IOException: Bad connect ack with
> > > firstBadLink 192.168.1.5:50010
> > > 09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block
> > > blk_-7060117896593271410_1034
> > > 09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception:
> > > java.io.IOException: Unable to create new block.
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > >
> > > 09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block
> > > blk_-7060117896593271410_1034 bad datanode[1] nodes == null
> > > 09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations.
> > > Source file
> > "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar"
> > > - Aborting...
> > > java.io.IOException: Bad connect ack with firstBadLink
> 192.168.1.5:50010
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > >
> > > Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and
> > > tasktracker. This is a serious concern for me because if I am unable
> > > to run jobs after a certain node goes down, then the purpose of the
> > > cluster is defeated.
> > >
> > > Could someone please help me in understanding whether it is a human
> > > error by me or it is a problem in Hadoop? Is there any way to avoid
> > > this?
> > >
> > > Please note that I can still read all my data in 'inputs' directory
> > > using the commands like:
> > >
> > > fossist@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat
> > > /fossist/inputs/input1.txt
> > >
> > > Please help.
> > >
> >
> > Here is an update. After waiting for sometime, don't know exactly how
> > much, the namenode web page on port 50070 showed the down node as
> > 'dead node' and I was able to run jobs again like before. Does this
> > mean that Hadoop takes a while to accept that a node is dead?
> >
> > Is this good by design? In the first five minutes or so when Hadoop is
> > in denial that a node is dead, all new jobs start failing. Is there a
> > way, I as a user, can tell Hadoop to start using the other available
> > other nodes in this denial period?
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: After a node goes down, I can't run jobs

Posted by jason hadoop <ja...@gmail.com>.

>From the 0.19.0 FsNameSystem.java, it looks like the timeout by default is 2
* 3000 + 300000 = 306000msec or 5 minutes 6 seconds.
If you have configured dfs.hosts.exclude in your hadoop-site.xml to point to
an empty file, that actually exists, you may add the name (as used in the
slaves file) for the node to that file and run
*hadoop dfsAdmin -refreshNodes
*
The namenode will decomission that node.

    long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) *
1000;
    this.heartbeatRecheckInterval = conf.getInt(
"heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes
    this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 *
heartbeatInterval;


On Sun, Apr 5, 2009 at 2:52 AM, Foss User <fo...@gmail.com> wrote:

> On Sun, Apr 5, 2009 at 3:18 PM, Foss User <fo...@gmail.com> wrote:
> > I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3)
> > First slave (4) Second Slave (5) Client from where I submit jobs
> >
> > I brought system no. 4 down by running:
> >
> > bin/hadoop-daemon.sh stop datanode
> > bin/hadoop-daemon.sh stop tasktracker
> >
> > After this I tried running my word count job again and I got this error:
> >
> > fossist@hadoop-client:~/mcr-wordcount$ hadoop jar
> > dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob
> > /fossist/inputs /fossist/output7                       09/04/05
> > 15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing
> > the arguments. Applications should implement Tool for the same.
> > 09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in
> > createBlockOutputStream java.io.IOException: Bad connect ack with
> > firstBadLink 192.168.1.5:50010
> > 09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block
> > blk_-6478273736277251749_1034
> > 09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in
> > createBlockOutputStream java.net.ConnectException: Connection refused
> > 09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block
> > blk_-7054779688981181941_1034
> > 09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in
> > createBlockOutputStream java.net.ConnectException: Connection refused
> > 09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block
> > blk_-6231549606860519001_1034
> > 09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in
> > createBlockOutputStream java.io.IOException: Bad connect ack with
> > firstBadLink 192.168.1.5:50010
> > 09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block
> > blk_-7060117896593271410_1034
> > 09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> > 09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block
> > blk_-7060117896593271410_1034 bad datanode[1] nodes == null
> > 09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations.
> > Source file
> "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar"
> > - Aborting...
> > java.io.IOException: Bad connect ack with firstBadLink 192.168.1.5:50010
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> > Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and
> > tasktracker. This is a serious concern for me because if I am unable
> > to run jobs after a certain node goes down, then the purpose of the
> > cluster is defeated.
> >
> > Could someone please help me in understanding whether it is a human
> > error by me or it is a problem in Hadoop? Is there any way to avoid
> > this?
> >
> > Please note that I can still read all my data in 'inputs' directory
> > using the commands like:
> >
> > fossist@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat
> > /fossist/inputs/input1.txt
> >
> > Please help.
> >
>
> Here is an update. After waiting for sometime, don't know exactly how
> much, the namenode web page on port 50070 showed the down node as
> 'dead node' and I was able to run jobs again like before. Does this
> mean that Hadoop takes a while to accept that a node is dead?
>
> Is this good by design? In the first five minutes or so when Hadoop is
> in denial that a node is dead, all new jobs start failing. Is there a
> way, I as a user, can tell Hadoop to start using the other available
> other nodes in this denial period?
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: After a node goes down, I can't run jobs

Posted by Foss User <fo...@gmail.com>.

On Sun, Apr 5, 2009 at 3:18 PM, Foss User <fo...@gmail.com> wrote:
> I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3)
> First slave (4) Second Slave (5) Client from where I submit jobs
>
> I brought system no. 4 down by running:
>
> bin/hadoop-daemon.sh stop datanode
> bin/hadoop-daemon.sh stop tasktracker
>
> After this I tried running my word count job again and I got this error:
>
> fossist@hadoop-client:~/mcr-wordcount$ hadoop jar
> dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob
> /fossist/inputs /fossist/output7                       09/04/05
> 15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing
> the arguments. Applications should implement Tool for the same.
> 09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 192.168.1.5:50010
> 09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block
> blk_-6478273736277251749_1034
> 09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block
> blk_-7054779688981181941_1034
> 09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block
> blk_-6231549606860519001_1034
> 09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 192.168.1.5:50010
> 09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block
> blk_-7060117896593271410_1034
> 09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> 09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block
> blk_-7060117896593271410_1034 bad datanode[1] nodes == null
> 09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations.
> Source file "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar"
> - Aborting...
> java.io.IOException: Bad connect ack with firstBadLink 192.168.1.5:50010
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and
> tasktracker. This is a serious concern for me because if I am unable
> to run jobs after a certain node goes down, then the purpose of the
> cluster is defeated.
>
> Could someone please help me in understanding whether it is a human
> error by me or it is a problem in Hadoop? Is there any way to avoid
> this?
>
> Please note that I can still read all my data in 'inputs' directory
> using the commands like:
>
> fossist@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat
> /fossist/inputs/input1.txt
>
> Please help.
>

Here is an update. After waiting for sometime, don't know exactly how
much, the namenode web page on port 50070 showed the down node as
'dead node' and I was able to run jobs again like before. Does this
mean that Hadoop takes a while to accept that a node is dead?

Is this good by design? In the first five minutes or so when Hadoop is
in denial that a node is dead, all new jobs start failing. Is there a
way, I as a user, can tell Hadoop to start using the other available
other nodes in this denial period?