You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Murali Krishna <mu...@yahoo-inc.com> on 2008/04/08 16:34:07 UTC

DFS behavior when the disk goes bad

Hi,

            We had a bad disk issue in one of the box and I am seeing
some strange behaviour. Just wanted to confirm whether this is
expected..

 

*	We are running a small cluster with 10 data nodes and a name
node
*	Each data node has 6 disks 
*	While a job was running, one of the disk in one data node got
corrupted and the node got blacklisted 
*	The job got killed because there was some space issue in the
entire cluster and it couldn't continue
*	Now I tried removing some unnecessary data, the disk usages
started coming down in all nodes except the node which got blacklisted
in the last job (is this expected?)
*	I restarted the entire cluster, after some time the disk usage
started coming down on that corrupted disk node and it went very low.
Essentially, it has removed everything from that node. (Does the dfs
remove the data from the all disks from the node if one of the disk was
bad? And why it didn't do before restarting?)

 

Thanks,

Murali


Re: DFS behavior when the disk goes bad

Posted by Amar Kamat <am...@yahoo-inc.com>.
Raghu Angadi wrote:
>
> The behavior seems correct.
> Assuming "blacklisted" to mean NameNode marked this node 'dead' :
I think the node is blacklisted by the JobTracker.
>
> Murali Krishna wrote:
>> *    We are running a small cluster with 10 data nodes and a name
>> node
>> *    Each data node has 6 disks *    While a job was running, one of 
>> the disk in one data node got
>> corrupted and the node got blacklisted *    The job got killed 
>> because there was some space issue in the
>> entire cluster and it couldn't continue
AFAIK a particular map task got killed on the bad/blacklisted node. All 
subsequent re-executions of the map task got killed because 'mkdir' 
failed and hence the job failed. I think he is using 15.3.
>> *    Now I tried removing some unnecessary data, the disk usages
>> started coming down in all nodes except the node which got blacklisted
>> in the last job (is this expected?)
>
> DataNode will be marked dead if it does not heartbeat with NameNode 
> for 10min or so. A bad disk could cause that (a thread trying delete 
> block file from the bad disk might be hung, for e.g.)
>
Murali, check the Namenode logs for all the entries concerning the bad 
node. Just make sure that the namenode knows that the node is bad.
Amar
> Once a datanode is marked dead, NameNode does not interact with is.. 
> so it did not remove any files from that node.
>
>> *    I restarted the entire cluster, after some time the disk usage
>> started coming down on that corrupted disk node and it went very low.
>> Essentially, it has removed everything from that node. (Does the dfs
>> remove the data from the all disks from the node if one of the disk was
>> bad? And why it didn't do before restarting?)
>
> By the time this node came and functioned well, NameNode has already 
> re-replicated the valid data that was sitting on this node.. so when 
> the node came up, NameNode deletes most of the files the datanode 
> since that data is either already deleted or replicated on other nodes.
>


Re: DFS behavior when the disk goes bad

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
The behavior seems correct.
Assuming "blacklisted" to mean NameNode marked this node 'dead' :

Murali Krishna wrote:
> *	We are running a small cluster with 10 data nodes and a name
> node
> *	Each data node has 6 disks 
> *	While a job was running, one of the disk in one data node got
> corrupted and the node got blacklisted 
> *	The job got killed because there was some space issue in the
> entire cluster and it couldn't continue
> *	Now I tried removing some unnecessary data, the disk usages
> started coming down in all nodes except the node which got blacklisted
> in the last job (is this expected?)

DataNode will be marked dead if it does not heartbeat with NameNode for 
10min or so. A bad disk could cause that (a thread trying delete block 
file from the bad disk might be hung, for e.g.)

Once a datanode is marked dead, NameNode does not interact with is.. so 
it did not remove any files from that node.

> *	I restarted the entire cluster, after some time the disk usage
> started coming down on that corrupted disk node and it went very low.
> Essentially, it has removed everything from that node. (Does the dfs
> remove the data from the all disks from the node if one of the disk was
> bad? And why it didn't do before restarting?)

By the time this node came and functioned well, NameNode has already 
re-replicated the valid data that was sitting on this node.. so when the 
node came up, NameNode deletes most of the files the datanode since that 
data is either already deleted or replicated on other nodes.