You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Chris Kline <ch...@rapleaf.com> on 2008/01/04 21:31:15 UTC

Under replicated block doesn't get fixed until DFS restart

fsck reports several under replicated blocks, but these do not get  
fixed until I restart DFS.  fsck also reports a missing block at the  
same time, but this should affect the function of fixing under  
replicated blocks.  Has anyone seen this before?

I'm running 0.15.0.

-Chris Kline

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Chris Kline <ch...@rapleaf.com>.

13 nodes.

On Jan 8, 2008, at 10:46 AM, Raghu Angadi wrote:

>
> Datanode log looks fine. There was an error while writing to  
> mirrors when the data was first written, which can happen  
> sometimes. It is still not clear why namenode did not try to  
> replicate these blocks until the next restart.
>
> How big is the cluster?
>
> Raghu.
> Chris Kline wrote:
>> Ah, yes, very interesting.  I get a "Read timed out" on  
>> 10.100.11.31, followed by a bunch of "Served block" messages...  
>> then finally a Transmitted block once HDFS was restarted.
>> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
>> 17:08:17,057 INFO org.apache.hadoop.dfs.DataNode: Received block  
>> blk_4522585614366970680 from /10.100.11.31 and Read timed out
>> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
>> 17:13:17,281 INFO org.apache.hadoop.dfs.DataNode: Served block  
>> blk_4522585614366970680 to /10.100.11.31
>> ------ a bunch of the above message
>> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
>> 18:53:18,737 INFO org.apache.hadoop.dfs.DataNode: Served block  
>> blk_4522585614366970680 to /10.100.11.31
>> ------ HDFS restarted
>> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,637 INFO org.apache.hadoop.dfs.DataNode: Starting thread  
>> to transfer block blk_4522585614366970680 to  
>> [Lorg.apache.hadoop.dfs.DatanodeInfo;@7e9ffe3f
>> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,718 INFO org.apache.hadoop.dfs.DataNode: Transmitted  
>> block blk_4522585614366970680 to /10.100.11.59:7277
>

-Chris

We're hiring engineers.  $10,007 reward for referrals we hire.

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Datanode log looks fine. There was an error while writing to mirrors 
when the data was first written, which can happen sometimes. It is still 
not clear why namenode did not try to replicate these blocks until the 
next restart.

How big is the cluster?

Raghu.
Chris Kline wrote:
> Ah, yes, very interesting.  I get a "Read timed out" on 10.100.11.31, 
> followed by a bunch of "Served block" messages... then finally a 
> Transmitted block once HDFS was restarted.
> 
> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02 
> 17:08:17,057 INFO org.apache.hadoop.dfs.DataNode: Received block 
> blk_4522585614366970680 from /10.100.11.31 and Read timed out
> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02 
> 17:13:17,281 INFO org.apache.hadoop.dfs.DataNode: Served block 
> blk_4522585614366970680 to /10.100.11.31
> ------ a bunch of the above message
> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02 
> 18:53:18,737 INFO org.apache.hadoop.dfs.DataNode: Served block 
> blk_4522585614366970680 to /10.100.11.31
> ------ HDFS restarted
> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,637 INFO org.apache.hadoop.dfs.DataNode: Starting thread to 
> transfer block blk_4522585614366970680 to 
> [Lorg.apache.hadoop.dfs.DatanodeInfo;@7e9ffe3f
> hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,718 INFO org.apache.hadoop.dfs.DataNode: Transmitted block 
> blk_4522585614366970680 to /10.100.11.59:7277

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Chris Kline <ch...@rapleaf.com>.

Ah, yes, very interesting.  I get a "Read timed out" on 10.100.11.31,  
followed by a bunch of "Served block" messages... then finally a  
Transmitted block once HDFS was restarted.

hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
17:08:17,057 INFO org.apache.hadoop.dfs.DataNode: Received block  
blk_4522585614366970680 from /10.100.11.31 and Read timed out
hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
17:13:17,281 INFO org.apache.hadoop.dfs.DataNode: Served block  
blk_4522585614366970680 to /10.100.11.31
------ a bunch of the above message
hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-02:2008-01-02  
18:53:18,737 INFO org.apache.hadoop.dfs.DataNode: Served block  
blk_4522585614366970680 to /10.100.11.31
------ HDFS restarted
hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,637 INFO org.apache.hadoop.dfs.DataNode: Starting thread to  
transfer block blk_4522585614366970680 to  
[Lorg.apache.hadoop.dfs.DatanodeInfo;@7e9ffe3f
hadoop-rapleaf-datanode-tf4.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,718 INFO org.apache.hadoop.dfs.DataNode: Transmitted block  
blk_4522585614366970680 to /10.100.11.59:7277

On Jan 7, 2008, at 11:06 PM, Raghu Angadi wrote:

> hmm... one possibility is that rest of the nodes were down. but the  
> name node showed other nodes were up. If more than one datanodes  
> were up, this indicates some bug.
>
> One last thing : grep for this block id at 10.100.11.31. You might  
> see some useful error message when the block was written.
>
> thanks,
> Raghu.
>
> Chris Kline wrote:
>> Sure.  Here are 2 blocks.
>> ----- blk_4522585614366970680
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
>> 17:06:17,058 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.allocateBlock: /data/hbase1/hregion_70236052/ 
>> compaction.dir/hregion_70236052/info/mapfiles/-1/index.  
>> blk_4522585614366970680
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
>> 17:06:17,062 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.addStoredBlock: blockMap updated: 10.100.11.31:7277 is  
>> added to blk_4522585614366970680
>> ------ Note that the following is AFTER I restarted HDFS -------
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange:  
>> UnderReplicationBlocks.update blk_4522585614366970680 curReplicas  
>> 1 curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas  3  
>> curPri  0 oldPri  3
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.UnderReplicationBlock.update:blk_4522585614366970680  
>> has only 1 replicas and need 3 replicas so is added to  
>> neededReplications at priority level 0
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.UnderReplicationBlock.remove: Removing block  
>> blk_4522585614366970680 from priority queue 0
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.pendingTransfer: blk_4522585614366970680 is removed  
>> from neededReplications to pendingReplications
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:58,808 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.pendingTransfer: ask 10.100.11.31:7277 to replicate  
>> blk_4522585614366970680 to datanode(s) 10.100.11.59:7277  
>> 10.100.11.64:7277
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,716 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.blockReceived: blk_4522585614366970680 is received from  
>> 10.100.11.59:7277
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,716 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.addStoredBlock: blockMap updated: 10.100.11.59:7277 is  
>> added to blk_4522585614366970680
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,717 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.blockReceived: blk_4522585614366970680 is received from  
>> 10.100.11.64:7277
>> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
>> 16:17:59,717 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
>> NameSystem.addStoredBlock: blockMap updated: 10.100.11.64:7277 is  
>> added to blk_4522585614366970680

-Chris

We're hiring engineers.  $10,007 reward for referrals we hire.

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

hmm... one possibility is that rest of the nodes were down. but the name 
node showed other nodes were up. If more than one datanodes were up, 
this indicates some bug.

One last thing : grep for this block id at 10.100.11.31. You might see 
some useful error message when the block was written.

thanks,
Raghu.

Chris Kline wrote:
> Sure.  Here are 2 blocks.
> 
> ----- blk_4522585614366970680
> 
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02 
> 17:06:17,058 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.allocateBlock: 
> /data/hbase1/hregion_70236052/compaction.dir/hregion_70236052/info/mapfiles/-1/index. 
> blk_4522585614366970680
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02 
> 17:06:17,062 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.100.11.31:7277 is added 
> to blk_4522585614366970680
> 
> ------ Note that the following is AFTER I restarted HDFS -------
> 
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange: 
> UnderReplicationBlocks.update blk_4522585614366970680 curReplicas 1 
> curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas  3 curPri  0 
> oldPri  3
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.UnderReplicationBlock.update:blk_4522585614366970680 has only 
> 1 replicas and need 3 replicas so is added to neededReplications at 
> priority level 0
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.UnderReplicationBlock.remove: Removing block 
> blk_4522585614366970680 from priority queue 0
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: blk_4522585614366970680 is removed from 
> neededReplications to pendingReplications
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:58,808 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.pendingTransfer: ask 10.100.11.31:7277 to replicate 
> blk_4522585614366970680 to datanode(s) 10.100.11.59:7277 10.100.11.64:7277
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,716 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.blockReceived: blk_4522585614366970680 is received from 
> 10.100.11.59:7277
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,716 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.100.11.59:7277 is added 
> to blk_4522585614366970680
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,717 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.blockReceived: blk_4522585614366970680 is received from 
> 10.100.11.64:7277
> hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03 
> 16:17:59,717 INFO org.apache.hadoop.dfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.100.11.64:7277 is added 
> to blk_4522585614366970680
>

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Chris Kline <ch...@rapleaf.com>.

Sure.  Here are 2 blocks.

----- blk_4522585614366970680

hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
17:06:17,058 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.allocateBlock: /data/hbase1/hregion_70236052/ 
compaction.dir/hregion_70236052/info/mapfiles/-1/index.  
blk_4522585614366970680
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
17:06:17,062 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.31:7277 is  
added to blk_4522585614366970680

------ Note that the following is AFTER I restarted HDFS -------

hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange:  
UnderReplicationBlocks.update blk_4522585614366970680 curReplicas 1  
curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas  3 curPri  0  
oldPri  3
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:15,882 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.UnderReplicationBlock.update:blk_4522585614366970680 has  
only 1 replicas and need 3 replicas so is added to neededReplications  
at priority level 0
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.UnderReplicationBlock.remove: Removing block  
blk_4522585614366970680 from priority queue 0
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:58,808 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.pendingTransfer: blk_4522585614366970680 is removed from  
neededReplications to pendingReplications
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:58,808 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.pendingTransfer: ask 10.100.11.31:7277 to replicate  
blk_4522585614366970680 to datanode(s) 10.100.11.59:7277  
10.100.11.64:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,716 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.blockReceived: blk_4522585614366970680 is received from  
10.100.11.59:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,716 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.59:7277 is  
added to blk_4522585614366970680
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,717 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.blockReceived: blk_4522585614366970680 is received from  
10.100.11.64:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:59,717 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.64:7277 is  
added to blk_4522585614366970680

------ blk_1984980330938654629

hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
17:11:17,231 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.allocateBlock: /data/hbase1/hregion_70236052/ 
compaction.dir/hregion_70236052/info/done. blk_1984980330938654629
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-02:2008-01-02  
17:11:17,234 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.31:7277 is  
added to blk_1984980330938654629

------ Note that the following is AFTER I restarted HDFS -------

hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:15,856 DEBUG org.apache.hadoop.dfs.StateChange:  
UnderReplicationBlocks.update blk_1984980330938654629 curReplicas 1  
curExpectedReplicas 3 oldReplicas 0 oldExpectedReplicas  3 curPri  0  
oldPri  3
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:15,856 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.UnderReplicationBlock.update:blk_1984980330938654629 has  
only 1 replicas and need 3 replicas so is added to neededReplications  
at priority level 0
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:55,805 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.UnderReplicationBlock.remove: Removing block  
blk_1984980330938654629 from priority queue 0
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:55,805 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.pendingTransfer: blk_1984980330938654629 is removed from  
neededReplications to pendingReplications
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:55,805 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.pendingTransfer: ask 10.100.11.31:7277 to replicate  
blk_1984980330938654629 to datanode(s) 10.100.11.62:7277  
10.100.11.61:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:56,660 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.blockReceived: blk_1984980330938654629 is received from  
10.100.11.62:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:56,660 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.62:7277 is  
added to blk_1984980330938654629
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:56,663 DEBUG org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.blockReceived: blk_1984980330938654629 is received from  
10.100.11.61:7277
hadoop-rapleaf-namenode-tf1.rapleaf.com.log.2008-01-03:2008-01-03  
16:17:56,663 INFO org.apache.hadoop.dfs.StateChange: BLOCK*  
NameSystem.addStoredBlock: blockMap updated: 10.100.11.61:7277 is  
added to blk_1984980330938654629

On Jan 7, 2008, at 12:32 PM, Raghu Angadi wrote:

>
> Could you grep for one of the block ids (say 4522585614366970680)  
> in the  namenode logs and post them here with timestamps?
>
> thanks,
> Raghu.
>
> Chris Kline wrote:
>> I believe there was at least one good block (see fsck output).   
>> All data nodes were up at the time according to the web page.  I  
>> grep'd the namenode log files for the under replicated blocks and  
>> only got an entry for when it was created and entries for when the  
>> replication was fixed after the HDFS restart.  Here is the result  
>> of fsck:
>> $HADOOP_HOME/bin/hadoop fsck /
>> .......................................................
>> /data/hbase1/hregion_70236052/compaction.dir/hregion_70236052/info/ 
>> done:  Under replicated blk_1984980330938654629. Target Replicas  
>> is 3 but found 1 replica(s).
>> ..
>> /data/hbase1/hregion_70236052/info/info/2807320534360768620:   
>> Under replicated blk_1717622121416314549. Target Replicas is 3 but  
>> found 1 replica(s).
>> .
>> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/ 
>> data:  Under replicated blk_-5019714262388221150. Target Replicas  
>> is 3 but found 1 replica(s).
>> .
>> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/ 
>> index:  Under replicated blk_4522585614366970680. Target Replicas  
>> is 3 but found 1 replica(s).
>> .........................................
>> /data/hbase1/log_10.100.11.63_1199307142676_60020/hlog.dat.000:   
>> Under replicated blk_-2871471426720379908. Target Replicas is 3  
>> but found 1 replica(s).
>> .......
>> /data/hbase1/log_10.100.11.65_1199307142711_60020/hlog.dat.000:  
>> MISSING 1 blocks of total size 0 B.
>> .Status: CORRUPT
>>  Total size:    71009158262 B
>>  Total blocks:  16318 (avg. block size 4351584 B)
>>  Total dirs:    21416
>>  Total files:   16253
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       5 (0.03064101 %)
>>  Target replication factor:     3
>>  Real replication factor:       2.9993873
>> The filesystem under path '/' is CORRUPT
>> -Chris
>> On Jan 4, 2008, at 1:02 PM, Raghu Angadi wrote:
>>> This is of course not expected. A more detailed info or log  
>>> message would help. Do you know if there is at least one good  
>>> block? Sometimes, the remaining "good" block might actually be  
>>> corrupted and thus can not replicate itself. Restarting might  
>>> just have brought up the datanodes that were down (for whatever  
>>> reason) before the restart.
>>>
>>> Raghu.
>>>
>>> Chris Kline wrote:
>>>> fsck reports several under replicated blocks, but these do not  
>>>> get fixed until I restart DFS.  fsck also reports a missing  
>>>> block at the same time, but this should affect the function of  
>>>> fixing under replicated blocks.  Has anyone seen this before?
>>>> I'm running 0.15.0.
>>>> -Chris Kline
>>>
>> -Chris
>> We're hiring engineers.  $10,007 reward for referrals we hire.
>

-Chris

We're hiring engineers.  $10,007 reward for referrals we hire.

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Could you grep for one of the block ids (say 4522585614366970680) in the 
  namenode logs and post them here with timestamps?

thanks,
Raghu.

Chris Kline wrote:
> I believe there was at least one good block (see fsck output).  All data 
> nodes were up at the time according to the web page.  I grep'd the 
> namenode log files for the under replicated blocks and only got an entry 
> for when it was created and entries for when the replication was fixed 
> after the HDFS restart.  Here is the result of fsck:
> 
> $HADOOP_HOME/bin/hadoop fsck /
> .......................................................
> /data/hbase1/hregion_70236052/compaction.dir/hregion_70236052/info/done:  
> Under replicated blk_1984980330938654629. Target Replicas is 3 but found 
> 1 replica(s).
> ..
> /data/hbase1/hregion_70236052/info/info/2807320534360768620:  Under 
> replicated blk_1717622121416314549. Target Replicas is 3 but found 1 
> replica(s).
> .
> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/data:  
> Under replicated blk_-5019714262388221150. Target Replicas is 3 but 
> found 1 replica(s).
> .
> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/index:  
> Under replicated blk_4522585614366970680. Target Replicas is 3 but found 
> 1 replica(s).
> .........................................
> /data/hbase1/log_10.100.11.63_1199307142676_60020/hlog.dat.000:  Under 
> replicated blk_-2871471426720379908. Target Replicas is 3 but found 1 
> replica(s).
> .......
> /data/hbase1/log_10.100.11.65_1199307142711_60020/hlog.dat.000: MISSING 
> 1 blocks of total size 0 B.
> .Status: CORRUPT
>  Total size:    71009158262 B
>  Total blocks:  16318 (avg. block size 4351584 B)
>  Total dirs:    21416
>  Total files:   16253
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       5 (0.03064101 %)
>  Target replication factor:     3
>  Real replication factor:       2.9993873
> 
> 
> The filesystem under path '/' is CORRUPT
> 
> 
> -Chris
> 
> On Jan 4, 2008, at 1:02 PM, Raghu Angadi wrote:
> 
>> This is of course not expected. A more detailed info or log message 
>> would help. Do you know if there is at least one good block? 
>> Sometimes, the remaining "good" block might actually be corrupted and 
>> thus can not replicate itself. Restarting might just have brought up 
>> the datanodes that were down (for whatever reason) before the restart.
>>
>> Raghu.
>>
>> Chris Kline wrote:
>>> fsck reports several under replicated blocks, but these do not get 
>>> fixed until I restart DFS.  fsck also reports a missing block at the 
>>> same time, but this should affect the function of fixing under 
>>> replicated blocks.  Has anyone seen this before?
>>> I'm running 0.15.0.
>>> -Chris Kline
>>
> 
> -Chris
> 
> We're hiring engineers.  $10,007 reward for referrals we hire.
> 
> 
>

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Ted Dunning <td...@veoh.com>.

The fsck output shows at least one file that doesn't have a replica.

I have seen situations where a block would not replicate.  It turned out to
be due to a downed node that had not yet been marked as down.  Once the
system finally realized the node was down, the fsck changed from reporting
low replication to reporting missing blocks.  I believe that the cause of
this was my messing around with datanodes trying to add storage on the fly
and then trying to move blocks around (unsuccessfully).

My only repair was to delete the files in question.  By design, this isn't a
big problem for me to do since I don't quite trust hadoop with mission
critical storage yet.


On 1/7/08 10:59 AM, "Chris Kline" <ch...@rapleaf.com> wrote:

> I believe there was at least one good block (see fsck output).  All
> data nodes were up at the time according to the web page.  I grep'd
> the namenode log files for the under replicated blocks and only got
> an entry for when it was created and entries for when the replication
> was fixed after the HDFS restart.  Here is the result of fsck:
> 
> $HADOOP_HOME/bin/hadoop fsck /
> .......................................................
> /data/hbase1/hregion_70236052/compaction.dir/hregion_70236052/info/
> done:  Under replicated blk_1984980330938654629. Target Replicas is 3
> but found 1 replica(s).
> ..
> /data/hbase1/hregion_70236052/info/info/2807320534360768620:  Under
> replicated blk_1717622121416314549. Target Replicas is 3 but found 1
> replica(s).
> .
> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/
> data:  Under replicated blk_-5019714262388221150. Target Replicas is
> 3 but found 1 replica(s).
> .
> /data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/
> index:  Under replicated blk_4522585614366970680. Target Replicas is
> 3 but found 1 replica(s).
> .........................................
> /data/hbase1/log_10.100.11.63_1199307142676_60020/hlog.dat.000:
> Under replicated blk_-2871471426720379908. Target Replicas is 3 but
> found 1 replica(s).
> .......
> /data/hbase1/log_10.100.11.65_1199307142711_60020/hlog.dat.000:
> MISSING 1 blocks of total size 0 B.
> .Status: CORRUPT
>   Total size:    71009158262 B
>   Total blocks:  16318 (avg. block size 4351584 B)
>   Total dirs:    21416
>   Total files:   16253
>   Over-replicated blocks:        0 (0.0 %)
>   Under-replicated blocks:       5 (0.03064101 %)
>   Target replication factor:     3
>   Real replication factor:       2.9993873
> 
> 
> The filesystem under path '/' is CORRUPT
> 
> 
> -Chris
> 
> On Jan 4, 2008, at 1:02 PM, Raghu Angadi wrote:
> 
>> This is of course not expected. A more detailed info or log message
>> would help. Do you know if there is at least one good block?
>> Sometimes, the remaining "good" block might actually be corrupted
>> and thus can not replicate itself. Restarting might just have
>> brought up the datanodes that were down (for whatever reason)
>> before the restart.
>> 
>> Raghu.
>> 
>> Chris Kline wrote:
>>> fsck reports several under replicated blocks, but these do not get
>>> fixed until I restart DFS.  fsck also reports a missing block at
>>> the same time, but this should affect the function of fixing under
>>> replicated blocks.  Has anyone seen this before?
>>> I'm running 0.15.0.
>>> -Chris Kline
>> 
> 
> -Chris
> 
> We're hiring engineers.  $10,007 reward for referrals we hire.
> 
>

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Chris Kline <ch...@rapleaf.com>.

I believe there was at least one good block (see fsck output).  All  
data nodes were up at the time according to the web page.  I grep'd  
the namenode log files for the under replicated blocks and only got  
an entry for when it was created and entries for when the replication  
was fixed after the HDFS restart.  Here is the result of fsck:

$HADOOP_HOME/bin/hadoop fsck /
.......................................................
/data/hbase1/hregion_70236052/compaction.dir/hregion_70236052/info/ 
done:  Under replicated blk_1984980330938654629. Target Replicas is 3  
but found 1 replica(s).
..
/data/hbase1/hregion_70236052/info/info/2807320534360768620:  Under  
replicated blk_1717622121416314549. Target Replicas is 3 but found 1  
replica(s).
.
/data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/ 
data:  Under replicated blk_-5019714262388221150. Target Replicas is  
3 but found 1 replica(s).
.
/data/hbase1/hregion_70236052/info/mapfiles/2807320534360768620/ 
index:  Under replicated blk_4522585614366970680. Target Replicas is  
3 but found 1 replica(s).
.........................................
/data/hbase1/log_10.100.11.63_1199307142676_60020/hlog.dat.000:   
Under replicated blk_-2871471426720379908. Target Replicas is 3 but  
found 1 replica(s).
.......
/data/hbase1/log_10.100.11.65_1199307142711_60020/hlog.dat.000:  
MISSING 1 blocks of total size 0 B.
.Status: CORRUPT
  Total size:    71009158262 B
  Total blocks:  16318 (avg. block size 4351584 B)
  Total dirs:    21416
  Total files:   16253
  Over-replicated blocks:        0 (0.0 %)
  Under-replicated blocks:       5 (0.03064101 %)
  Target replication factor:     3
  Real replication factor:       2.9993873


The filesystem under path '/' is CORRUPT


-Chris

On Jan 4, 2008, at 1:02 PM, Raghu Angadi wrote:

> This is of course not expected. A more detailed info or log message  
> would help. Do you know if there is at least one good block?  
> Sometimes, the remaining "good" block might actually be corrupted  
> and thus can not replicate itself. Restarting might just have  
> brought up the datanodes that were down (for whatever reason)  
> before the restart.
>
> Raghu.
>
> Chris Kline wrote:
>> fsck reports several under replicated blocks, but these do not get  
>> fixed until I restart DFS.  fsck also reports a missing block at  
>> the same time, but this should affect the function of fixing under  
>> replicated blocks.  Has anyone seen this before?
>> I'm running 0.15.0.
>> -Chris Kline
>

-Chris

We're hiring engineers.  $10,007 reward for referrals we hire.

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Ted Dunning <td...@veoh.com>.

It can take a long time to decide that a node is down.  If that down node
has the last copy of a file, then it won't get replicated.

I run a balancing script every few hours.  It wanders through the files and
ups the replication of each file temporarily.  This is important because
initial allocations of blocks isn't done as well as increased allocations.
It also causes the system to respond sooner to low replication count files
... if a datanode is down, then the remaining nodes will respond to the
increased replication count and the down node won't respond to requests to
delete the block.  This results in a desirable improvement in replication
for those nearly orphaned blocks.

On 1/4/08 1:02 PM, "Raghu Angadi" <ra...@yahoo-inc.com> wrote:

> This is of course not expected. A more detailed info or log message
> would help. Do you know if there is at least one good block? Sometimes,
> the remaining "good" block might actually be corrupted and thus can not
> replicate itself. Restarting might just have brought up the datanodes
> that were down (for whatever reason) before the restart.
> 
> Raghu.
> 
> Chris Kline wrote:
>> fsck reports several under replicated blocks, but these do not get fixed
>> until I restart DFS.  fsck also reports a missing block at the same
>> time, but this should affect the function of fixing under replicated
>> blocks.  Has anyone seen this before?
>> 
>> I'm running 0.15.0.
>> 
>> -Chris Kline
>

Re: Under replicated block doesn't get fixed until DFS restart

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

This is of course not expected. A more detailed info or log message 
would help. Do you know if there is at least one good block? Sometimes, 
the remaining "good" block might actually be corrupted and thus can not 
replicate itself. Restarting might just have brought up the datanodes 
that were down (for whatever reason) before the restart.

Raghu.

Chris Kline wrote:
> fsck reports several under replicated blocks, but these do not get fixed 
> until I restart DFS.  fsck also reports a missing block at the same 
> time, but this should affect the function of fixing under replicated 
> blocks.  Has anyone seen this before?
> 
> I'm running 0.15.0.
> 
> -Chris Kline