You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jie Li <ji...@cs.duke.edu> on 2011/12/08 03:39:10 UTC

Why did one slave node die when the cluster was idle? Log info provided.

Hi all,

I have a 8 slaves cluster on Amazon Ec2. It was idle today but I found one
node died somehow. I couldn't figure out why. Below is some relevant log
info. Any input is appreciated.

jobtracker.log (only one entry)
2011-12-07 15:48:52,661 INFO org.apache.hadoop.mapred.JobTracker: Lost
tracker 'tracker_ip-10-6-99-9.ec2.internal:localhost.localdomain/
127.0.0.1:48901'

namenode.log
(one report every hour...)
2011-12-07 15:11:45,505 INFO org.apache.hadoop.hdfs.StateChange: *BLOCK*
NameSystem.processReport: from 10.6.99.9:50010, blocks: 808, processing
time: 1 msecs
2011-12-07 15:53:06,480 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 10.6.99.9:50010
2011-12-07 15:53:06,490 INFO org.apache.hadoop.net.NetworkTopology:
Removing a node: /default-rack/10.6.99.9:50010

deadnode's tasktracker.log
(all earlier entries are like these, these are the last two)
2011-12-07 15:27:28,469 INFO org.apache.hadoop.mapred.UserLogCleaner:
Deleting user log path job_201112032038_0348
2011-12-07 15:27:28,477 INFO org.apache.hadoop.mapred.UserLogCleaner:
Deleting user log path job_201112032038_0350

deadnode's datanode.log
(all earlier entries are like these, these are the last two)
2011-12-07 14:11:43,464 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 808 blocks
took 82 msec to generate and 6 msecs for RPC and NN processing
2011-12-07 15:11:45,517 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 808 blocks
took 122 msec to generate and 6 msecs for RPC and NN processing

Jie

Re: Why did one slave node die when the cluster was idle? Log info provided.

Posted by Harsh J <ha...@cloudera.com>.
Jie,

When you say a node died, do you mean to say hadoop's services alone died or the whole node itself went down and came back up (uptime can tell, perhaps)?

On 08-Dec-2011, at 8:09 AM, Jie Li wrote:

> Hi all,
> 
> I have a 8 slaves cluster on Amazon Ec2. It was idle today but I found one
> node died somehow. I couldn't figure out why. Below is some relevant log
> info. Any input is appreciated.
> 
> jobtracker.log (only one entry)
> 2011-12-07 15:48:52,661 INFO org.apache.hadoop.mapred.JobTracker: Lost
> tracker 'tracker_ip-10-6-99-9.ec2.internal:localhost.localdomain/
> 127.0.0.1:48901'
> 
> namenode.log
> (one report every hour...)
> 2011-12-07 15:11:45,505 INFO org.apache.hadoop.hdfs.StateChange: *BLOCK*
> NameSystem.processReport: from 10.6.99.9:50010, blocks: 808, processing
> time: 1 msecs
> 2011-12-07 15:53:06,480 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.heartbeatCheck: lost heartbeat from 10.6.99.9:50010
> 2011-12-07 15:53:06,490 INFO org.apache.hadoop.net.NetworkTopology:
> Removing a node: /default-rack/10.6.99.9:50010
> 
> deadnode's tasktracker.log
> (all earlier entries are like these, these are the last two)
> 2011-12-07 15:27:28,469 INFO org.apache.hadoop.mapred.UserLogCleaner:
> Deleting user log path job_201112032038_0348
> 2011-12-07 15:27:28,477 INFO org.apache.hadoop.mapred.UserLogCleaner:
> Deleting user log path job_201112032038_0350
> 
> deadnode's datanode.log
> (all earlier entries are like these, these are the last two)
> 2011-12-07 14:11:43,464 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 808 blocks
> took 82 msec to generate and 6 msecs for RPC and NN processing
> 2011-12-07 15:11:45,517 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 808 blocks
> took 122 msec to generate and 6 msecs for RPC and NN processing
> 
> Jie