You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Rajat Goel <ra...@gmail.com> on 2011/12/27 14:09:18 UTC

MapReduce job failing when a node of cluster is rebooted

Hi,

I have a 7-node setup (1 - Namenode/JobTracker, 6 - Datanodes/TaskTrackers)
running Hadoop version 0.20.203.

I performed the following test:
Initially cluster is running smoothly. Just before launching a MapReduce
job (about one or two minutes before), I shutdown one of the data nodes
(rebooted the machine). Then my MapReduce job starts but immediately fails
with following messages on stderr:

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.
NOTICE: Configuration: /device.map    /region.map    /url.map
/data/output/2011/12/26/08
 PS:192.168.100.206:11111    3600    true    Notice
11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to process
: 24
11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
blk_-6309642664478517067_35619
11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
192.168.100.7:50010
11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.NoRouteToHostException: No route to host
11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
blk_4129088682008611797_35619
11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
blk_3596375242483863157_35619
11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
blk_724369205729364853_35619
11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
blk_724369205729364853_35619 bad datanode[1] nodes == null
11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
Source file
"/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
- Aborting...
11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
Exception in thread "main" java.io.IOException: Bad connect ack with
firstBadLink as 192.168.100.5:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
: java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
java.io.IOException: Bad connect ack with firstBadLink as
192.168.100.5:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)


- In the above logs, 192.168.100.5 is the machine I rebooted.
- JobTracker's log file doesn't have any logs in the above time period.
- NameNode's log file doesn't have any exceptions or any messages related
to the above error logs.
- All nodes can access each other via IP or hostnames.
- ulimit values for files is set to 1024 but I don't see many connections
in CLOSE_WAIT state (Googled a bit and some ppl suggest that this value
could be a culprit in some cases)
- My Hadoop configuration files have settings for no. of mappers (8),
reducers (4), io.sort.mb (512 mb). Most of the other parameters have been
configured to their default values.

Can someone please provide any pointers to solution of this problem?

Thanks,
Rajat

MapReduce job failing when a node of cluster is rebooted

Posted by Rajat Goel <ra...@gmail.com>.
Posing the issue on this forum as well.

Regards,
Rajat

---------- Forwarded message ----------
From: Rajat Goel <ra...@gmail.com>
Date: Wed, Dec 28, 2011 at 12:04 PM
Subject: Re: MapReduce job failing when a node of cluster is rebooted
To: common-user@hadoop.apache.org


No its not connecting, its out of the cluster. I am testing node failure
scenario so I am not bothered about node going down.

The issue here is that the job should succeed with remaining nodes
as the replication factor is > 1, but the job is failing.

Regards,
Rajat


On Tue, Dec 27, 2011 at 7:25 PM, alo alt <wg...@googlemail.com> wrote:

> Did the DN you've just rebooted connecting to the NN? Mostly the
> datanode daemon is'nt running, check it:
> ps waux |grep "DataNode" |grep -v "grep"
>
> - ALex
>
> On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel <ra...@gmail.com> wrote:
> > Yes. Hdfs and Mapred related dirs are set outside of /tmp.
> >
> > On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wg...@googlemail.com>
> wrote:
> >
> >> Hi,
> >>
> >> did you set the hdfs-related dirs outside of /tmp? Most *ux systems
> >> clean them up on reboot.
> >>
> >> - Alex
> >>
> >> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <ra...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I have a 7-node setup (1 - Namenode/JobTracker, 6 -
> >> Datanodes/TaskTrackers)
> >> > running Hadoop version 0.20.203.
> >> >
> >> > I performed the following test:
> >> > Initially cluster is running smoothly. Just before launching a
> MapReduce
> >> > job (about one or two minutes before), I shutdown one of the data
> nodes
> >> > (rebooted the machine). Then my MapReduce job starts but immediately
> >> fails
> >> > with following messages on stderr:
> >> >
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > NOTICE: Configuration: /device.map    /region.map    /url.map
> >> > /data/output/2011/12/26/08
> >> >  PS:192.168.100.206:11111    3600    true    Notice
> >> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
> >> > parsing the arguments. Applications should implement Tool for the
> same.
> >> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
> >> process
> >> > : 24
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
> >> > blk_-6309642664478517067_35619
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
> >> > 192.168.100.7:50010
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.net.NoRouteToHostException: No route to host
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
> >> > blk_4129088682008611797_35619
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
> >> > blk_3596375242483863157_35619
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
> >> > blk_724369205729364853_35619
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
> >> > java.io.IOException: Unable to create new block.
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
> >> > blk_724369205729364853_35619 bad datanode[1] nodes == null
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
> >> > Source file
> >> >
> >>
> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
> >> > - Aborting...
> >> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
> >> >
> >>
> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
> >> > Exception in thread "main" java.io.IOException: Bad connect ack with
> >> > firstBadLink as 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
> >> >
> >>
> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
> >> > : java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> >
> >> > - In the above logs, 192.168.100.5 is the machine I rebooted.
> >> > - JobTracker's log file doesn't have any logs in the above time
> period.
> >> > - NameNode's log file doesn't have any exceptions or any messages
> related
> >> > to the above error logs.
> >> > - All nodes can access each other via IP or hostnames.
> >> > - ulimit values for files is set to 1024 but I don't see many
> connections
> >> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this
> value
> >> > could be a culprit in some cases)
> >> > - My Hadoop configuration files have settings for no. of mappers (8),
> >> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have
> been
> >> > configured to their default values.
> >> >
> >> > Can someone please provide any pointers to solution of this problem?
> >> >
> >> > Thanks,
> >> > Rajat
> >>
> >>
> >>
> >> --
> >> Alexander Lorenz
> >> http://mapredit.blogspot.com
> >>
> >> P Think of the environment: please don't print this email unless you
> >> really need to.
> >>
>
>
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> P Think of the environment: please don't print this email unless you
> really need to.
>

Re: MapReduce job failing when a node of cluster is rebooted

Posted by Rajat Goel <ra...@gmail.com>.
No its not connecting, its out of the cluster. I am testing node failure
scenario so I am not bothered about node going down.

The issue here is that the job should succeed with remaining nodes
as the replication factor is > 1, but the job is failing.

Regards,
Rajat

On Tue, Dec 27, 2011 at 7:25 PM, alo alt <wg...@googlemail.com> wrote:

> Did the DN you've just rebooted connecting to the NN? Mostly the
> datanode daemon is'nt running, check it:
> ps waux |grep "DataNode" |grep -v "grep"
>
> - ALex
>
> On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel <ra...@gmail.com> wrote:
> > Yes. Hdfs and Mapred related dirs are set outside of /tmp.
> >
> > On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wg...@googlemail.com>
> wrote:
> >
> >> Hi,
> >>
> >> did you set the hdfs-related dirs outside of /tmp? Most *ux systems
> >> clean them up on reboot.
> >>
> >> - Alex
> >>
> >> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <ra...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I have a 7-node setup (1 - Namenode/JobTracker, 6 -
> >> Datanodes/TaskTrackers)
> >> > running Hadoop version 0.20.203.
> >> >
> >> > I performed the following test:
> >> > Initially cluster is running smoothly. Just before launching a
> MapReduce
> >> > job (about one or two minutes before), I shutdown one of the data
> nodes
> >> > (rebooted the machine). Then my MapReduce job starts but immediately
> >> fails
> >> > with following messages on stderr:
> >> >
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please
> >> > use org.apache.hadoop.log.metrics.EventCounter in all the
> >> log4j.properties
> >> > files.
> >> > NOTICE: Configuration: /device.map    /region.map    /url.map
> >> > /data/output/2011/12/26/08
> >> >  PS:192.168.100.206:11111    3600    true    Notice
> >> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
> >> > parsing the arguments. Applications should implement Tool for the
> same.
> >> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
> >> process
> >> > : 24
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
> >> > blk_-6309642664478517067_35619
> >> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
> >> > 192.168.100.7:50010
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.net.NoRouteToHostException: No route to host
> >> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
> >> > blk_4129088682008611797_35619
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
> >> > blk_3596375242483863157_35619
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
> >> createBlockOutputStream
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
> >> > blk_724369205729364853_35619
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
> >> > java.io.IOException: Unable to create new block.
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
> >> > blk_724369205729364853_35619 bad datanode[1] nodes == null
> >> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
> >> > Source file
> >> >
> >>
> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
> >> > - Aborting...
> >> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
> >> >
> >>
> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
> >> > Exception in thread "main" java.io.IOException: Bad connect ack with
> >> > firstBadLink as 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
> >> >
> >>
> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
> >> > : java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> > java.io.IOException: Bad connect ack with firstBadLink as
> >> > 192.168.100.5:50010
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >> >    at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >> >
> >> >
> >> > - In the above logs, 192.168.100.5 is the machine I rebooted.
> >> > - JobTracker's log file doesn't have any logs in the above time
> period.
> >> > - NameNode's log file doesn't have any exceptions or any messages
> related
> >> > to the above error logs.
> >> > - All nodes can access each other via IP or hostnames.
> >> > - ulimit values for files is set to 1024 but I don't see many
> connections
> >> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this
> value
> >> > could be a culprit in some cases)
> >> > - My Hadoop configuration files have settings for no. of mappers (8),
> >> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have
> been
> >> > configured to their default values.
> >> >
> >> > Can someone please provide any pointers to solution of this problem?
> >> >
> >> > Thanks,
> >> > Rajat
> >>
> >>
> >>
> >> --
> >> Alexander Lorenz
> >> http://mapredit.blogspot.com
> >>
> >> P Think of the environment: please don't print this email unless you
> >> really need to.
> >>
>
>
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> P Think of the environment: please don't print this email unless you
> really need to.
>

Re: MapReduce job failing when a node of cluster is rebooted

Posted by alo alt <wg...@googlemail.com>.
Did the DN you've just rebooted connecting to the NN? Mostly the
datanode daemon is'nt running, check it:
ps waux |grep "DataNode" |grep -v "grep"

- ALex

On Tue, Dec 27, 2011 at 2:44 PM, Rajat Goel <ra...@gmail.com> wrote:
> Yes. Hdfs and Mapred related dirs are set outside of /tmp.
>
> On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wg...@googlemail.com> wrote:
>
>> Hi,
>>
>> did you set the hdfs-related dirs outside of /tmp? Most *ux systems
>> clean them up on reboot.
>>
>> - Alex
>>
>> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <ra...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a 7-node setup (1 - Namenode/JobTracker, 6 -
>> Datanodes/TaskTrackers)
>> > running Hadoop version 0.20.203.
>> >
>> > I performed the following test:
>> > Initially cluster is running smoothly. Just before launching a MapReduce
>> > job (about one or two minutes before), I shutdown one of the data nodes
>> > (rebooted the machine). Then my MapReduce job starts but immediately
>> fails
>> > with following messages on stderr:
>> >
>> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
>> > use org.apache.hadoop.log.metrics.EventCounter in all the
>> log4j.properties
>> > files.
>> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
>> > use org.apache.hadoop.log.metrics.EventCounter in all the
>> log4j.properties
>> > files.
>> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
>> > use org.apache.hadoop.log.metrics.EventCounter in all the
>> log4j.properties
>> > files.
>> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
>> > use org.apache.hadoop.log.metrics.EventCounter in all the
>> log4j.properties
>> > files.
>> > NOTICE: Configuration: /device.map    /region.map    /url.map
>> > /data/output/2011/12/26/08
>> >  PS:192.168.100.206:11111    3600    true    Notice
>> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
>> > parsing the arguments. Applications should implement Tool for the same.
>> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
>> process
>> > : 24
>> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 192.168.100.5:50010
>> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
>> > blk_-6309642664478517067_35619
>> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
>> > 192.168.100.7:50010
>> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> > java.net.NoRouteToHostException: No route to host
>> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
>> > blk_4129088682008611797_35619
>> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 192.168.100.5:50010
>> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
>> > blk_3596375242483863157_35619
>> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 192.168.100.5:50010
>> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
>> > blk_724369205729364853_35619
>> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
>> > java.io.IOException: Unable to create new block.
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
>> >
>> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
>> > blk_724369205729364853_35619 bad datanode[1] nodes == null
>> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
>> > Source file
>> >
>> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
>> > - Aborting...
>> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
>> >
>> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
>> > Exception in thread "main" java.io.IOException: Bad connect ack with
>> > firstBadLink as 192.168.100.5:50010
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
>> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
>> >
>> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
>> > : java.io.IOException: Bad connect ack with firstBadLink as
>> > 192.168.100.5:50010
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 192.168.100.5:50010
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>> >    at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
>> >
>> >
>> > - In the above logs, 192.168.100.5 is the machine I rebooted.
>> > - JobTracker's log file doesn't have any logs in the above time period.
>> > - NameNode's log file doesn't have any exceptions or any messages related
>> > to the above error logs.
>> > - All nodes can access each other via IP or hostnames.
>> > - ulimit values for files is set to 1024 but I don't see many connections
>> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this value
>> > could be a culprit in some cases)
>> > - My Hadoop configuration files have settings for no. of mappers (8),
>> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have been
>> > configured to their default values.
>> >
>> > Can someone please provide any pointers to solution of this problem?
>> >
>> > Thanks,
>> > Rajat
>>
>>
>>
>> --
>> Alexander Lorenz
>> http://mapredit.blogspot.com
>>
>> P Think of the environment: please don't print this email unless you
>> really need to.
>>



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.

Re: MapReduce job failing when a node of cluster is rebooted

Posted by Rajat Goel <ra...@gmail.com>.
Yes. Hdfs and Mapred related dirs are set outside of /tmp.

On Tue, Dec 27, 2011 at 6:48 PM, alo alt <wg...@googlemail.com> wrote:

> Hi,
>
> did you set the hdfs-related dirs outside of /tmp? Most *ux systems
> clean them up on reboot.
>
> - Alex
>
> On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <ra...@gmail.com> wrote:
> > Hi,
> >
> > I have a 7-node setup (1 - Namenode/JobTracker, 6 -
> Datanodes/TaskTrackers)
> > running Hadoop version 0.20.203.
> >
> > I performed the following test:
> > Initially cluster is running smoothly. Just before launching a MapReduce
> > job (about one or two minutes before), I shutdown one of the data nodes
> > (rebooted the machine). Then my MapReduce job starts but immediately
> fails
> > with following messages on stderr:
> >
> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> > use org.apache.hadoop.log.metrics.EventCounter in all the
> log4j.properties
> > files.
> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> > use org.apache.hadoop.log.metrics.EventCounter in all the
> log4j.properties
> > files.
> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> > use org.apache.hadoop.log.metrics.EventCounter in all the
> log4j.properties
> > files.
> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> > use org.apache.hadoop.log.metrics.EventCounter in all the
> log4j.properties
> > files.
> > NOTICE: Configuration: /device.map    /region.map    /url.map
> > /data/output/2011/12/26/08
> >  PS:192.168.100.206:11111    3600    true    Notice
> > 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
> > parsing the arguments. Applications should implement Tool for the same.
> > 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to
> process
> > : 24
> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 192.168.100.5:50010
> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
> > blk_-6309642664478517067_35619
> > 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
> > 192.168.100.7:50010
> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream
> > java.net.NoRouteToHostException: No route to host
> > 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
> > blk_4129088682008611797_35619
> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 192.168.100.5:50010
> > 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
> > blk_3596375242483863157_35619
> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 192.168.100.5:50010
> > 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
> > blk_724369205729364853_35619
> > 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >
> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
> > blk_724369205729364853_35619 bad datanode[1] nodes == null
> > 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
> > Source file
> >
> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
> > - Aborting...
> > 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
> >
> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
> > Exception in thread "main" java.io.IOException: Bad connect ack with
> > firstBadLink as 192.168.100.5:50010
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> > 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
> >
> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
> > : java.io.IOException: Bad connect ack with firstBadLink as
> > 192.168.100.5:50010
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 192.168.100.5:50010
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
> >    at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> >
> >
> > - In the above logs, 192.168.100.5 is the machine I rebooted.
> > - JobTracker's log file doesn't have any logs in the above time period.
> > - NameNode's log file doesn't have any exceptions or any messages related
> > to the above error logs.
> > - All nodes can access each other via IP or hostnames.
> > - ulimit values for files is set to 1024 but I don't see many connections
> > in CLOSE_WAIT state (Googled a bit and some ppl suggest that this value
> > could be a culprit in some cases)
> > - My Hadoop configuration files have settings for no. of mappers (8),
> > reducers (4), io.sort.mb (512 mb). Most of the other parameters have been
> > configured to their default values.
> >
> > Can someone please provide any pointers to solution of this problem?
> >
> > Thanks,
> > Rajat
>
>
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> P Think of the environment: please don't print this email unless you
> really need to.
>

Re: MapReduce job failing when a node of cluster is rebooted

Posted by alo alt <wg...@googlemail.com>.
Hi,

did you set the hdfs-related dirs outside of /tmp? Most *ux systems
clean them up on reboot.

- Alex

On Tue, Dec 27, 2011 at 2:09 PM, Rajat Goel <ra...@gmail.com> wrote:
> Hi,
>
> I have a 7-node setup (1 - Namenode/JobTracker, 6 - Datanodes/TaskTrackers)
> running Hadoop version 0.20.203.
>
> I performed the following test:
> Initially cluster is running smoothly. Just before launching a MapReduce
> job (about one or two minutes before), I shutdown one of the data nodes
> (rebooted the machine). Then my MapReduce job starts but immediately fails
> with following messages on stderr:
>
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> NOTICE: Configuration: /device.map    /region.map    /url.map
> /data/output/2011/12/26/08
>  PS:192.168.100.206:11111    3600    true    Notice
> 11/12/26 09:10:26 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 11/12/26 09:10:26 INFO input.FileInputFormat: Total input paths to process
> : 24
> 11/12/26 09:10:37 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 192.168.100.5:50010
> 11/12/26 09:10:37 INFO hdfs.DFSClient: Abandoning block
> blk_-6309642664478517067_35619
> 11/12/26 09:10:37 INFO hdfs.DFSClient: Waiting to find target node:
> 192.168.100.7:50010
> 11/12/26 09:10:44 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.NoRouteToHostException: No route to host
> 11/12/26 09:10:44 INFO hdfs.DFSClient: Abandoning block
> blk_4129088682008611797_35619
> 11/12/26 09:10:53 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 192.168.100.5:50010
> 11/12/26 09:10:53 INFO hdfs.DFSClient: Abandoning block
> blk_3596375242483863157_35619
> 11/12/26 09:11:01 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 192.168.100.5:50010
> 11/12/26 09:11:01 INFO hdfs.DFSClient: Abandoning block
> blk_724369205729364853_35619
> 11/12/26 09:11:07 WARN hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3002)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
>
> 11/12/26 09:11:07 WARN hdfs.DFSClient: Error Recovery for block
> blk_724369205729364853_35619 bad datanode[1] nodes == null
> 11/12/26 09:11:07 WARN hdfs.DFSClient: Could not get block locations.
> Source file
> "/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split"
> - Aborting...
> 11/12/26 09:11:07 INFO mapred.JobClient: Cleaning up the staging area
> hdfs://machine-100-205:9000/data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292
> Exception in thread "main" java.io.IOException: Bad connect ack with
> firstBadLink as 192.168.100.5:50010
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
> 11/12/26 09:11:07 ERROR hdfs.DFSClient: Exception closing file
> /data/hadoop-admin/mapred/staging/admin/.staging/job_201112200923_0292/job.split
> : java.io.IOException: Bad connect ack with firstBadLink as
> 192.168.100.5:50010
> java.io.IOException: Bad connect ack with firstBadLink as
> 192.168.100.5:50010
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:3068)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2983)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
>    at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)
>
>
> - In the above logs, 192.168.100.5 is the machine I rebooted.
> - JobTracker's log file doesn't have any logs in the above time period.
> - NameNode's log file doesn't have any exceptions or any messages related
> to the above error logs.
> - All nodes can access each other via IP or hostnames.
> - ulimit values for files is set to 1024 but I don't see many connections
> in CLOSE_WAIT state (Googled a bit and some ppl suggest that this value
> could be a culprit in some cases)
> - My Hadoop configuration files have settings for no. of mappers (8),
> reducers (4), io.sort.mb (512 mb). Most of the other parameters have been
> configured to their default values.
>
> Can someone please provide any pointers to solution of this problem?
>
> Thanks,
> Rajat



-- 
Alexander Lorenz
http://mapredit.blogspot.com

P Think of the environment: please don't print this email unless you
really need to.