You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/04/07 19:56:25 UTC

[jira] Commented: (HADOOP-124) Files still rotting in DFS of latest Hadoop

    [ http://issues.apache.org/jira/browse/HADOOP-124?page=comments#action_12373671 ] 

Doug Cutting commented on HADOOP-124:
-------------------------------------

Did you potentially start identically configured datanodes as a different user?

Right now the only "lock" preventing this is the pid file used by the nutch-daemon.sh script.  Perhaps the datanode should lock each directory in dfs.data.dir?  That should prevent this, no?

I suppose this could also happen if the datanode lost its connection to the namenode, but the namenode had not yet timed out the datanode.  Then the datanode would reconnect and blocks might be doubly-reported.  To fix this, perhaps the namenode should refuse to represent more than one copy of a block from a given IP?  If a second is reported, the first should be forgotten?



> Files still rotting in DFS of latest Hadoop
> -------------------------------------------
>
>          Key: HADOOP-124
>          URL: http://issues.apache.org/jira/browse/HADOOP-124
>      Project: Hadoop
>         Type: Bug

>   Components: dfs
>  Environment: ~30 node cluster
>     Reporter: Bryan Pendleton

>
> DFS files are still rotting.
> I suspect that there's a problem with block accounting/detecting identical hosts in the namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the  problem (which resulted in losing about 500gb worth of temporary data because of missing blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were extra datanodes running against the same paths, and that the namenode was counting those as replicated instances, which then showed up over-replicated, and one of them was told to delete its local block, leading to the block actually getting lost.
> I will debug it more the next time the situation arises. This is at least the 5th time I've had a large amount of file data "rot" in DFS since January.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira