You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Xiao Chen (JIRA)" <ji...@apache.org> on 2017/10/12 05:17:00 UTC
[jira] [Created] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker

Xiao Chen created HDFS-12642:
--------------------------------

             Summary: Log block and datanode details in BlockRecoveryWorker
                 Key: HDFS-12642
                 URL: https://issues.apache.org/jira/browse/HDFS-12642
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: datanode
            Reporter: Xiao Chen
            Assignee: Xiao Chen


In a recent investigation, we have seen a weird block recovery issue, which is difficult to reach to a conclusion because of insufficient logs.

For the most critical part of the events, we see block recovery failed to {{commitBlockSynchronization]} on the NN, due to the block not closed. This leaves the file as open forever (for 1+ months).

The reason the block was not closed on NN, was because it is configured with {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest genstamp.

We were not able to tell why only 1 replica is on latest genstamp.

From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} was called on each of the 7 DNs the block were ever placed. All DNs but ps2204 and ps3765 failed because of genstamp comparison - that's expected. ps2204 and ps3765 have gone past the comparison (since no exceptions from their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on ps3765.

This jira is to propose we log more details when {{BlockRecoveryWorker}} is about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could be figured out in the future.

{noformat}
$ grep "updateReplica:" ps2204.dn.log 
$ grep "updateReplica:" ps3765.dn.log 
hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC Server handler 6 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, blk_2172795728_1106150312, RUR
$ grep "initReplicaRecovery:" ps2204.dn.log 
hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR
hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RWR to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}}
$ grep "initReplicaRecovery:" ps3765.dn.log 
hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW
hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW
hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RBW to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}}
{noformat}

P.S. HDFS-11499 was once suspected, but non-conclusive since we don't have all NN logs to know about the decommission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org