You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Sumit Kumar <sk...@yahoo.com.INVALID> on 2016/05/04 00:29:09 UTC

Re: Reconfigured Namenode stuck in safemode

Hello All,

We've been experimenting with namenode re-configuration in zookeeper based HA configuration. We've been able to automate the set up part using bigtop scripts. We're trying to use the same setup scripts for re-configuration should one of the namenodes die. I found that these scripts wait for namenode to exit safe mode by issuing following command:

hdfs dfsadmin -safemode wait
This works fine for the initial cluster setup part however for re-configuration, the new namenode occassionally gets stuck in safemode forever. For easier understanding lets say we launched a cluster with 5 hosts, nn1 and nn2 were running fine and at some point of time we replace nn1 with nn3 (a completely new host). For this replacement to take affect we change configurations on all the hosts to point to nn3 and restart hadoop daemons there. We're seeing that nn1 comes back online fine but nn3 remains stuck in safemode forever.
Running hdfs dfsadmin -safemode get, shows exactly the same: nn1 is fine (out of safemode) and nn3 in safemode. If i run

hdfs dfsadmin -safemode leave on nn3, it would leave safemode immediately and doing ls, cp, mv on hdfs would work just fine. We've been debating if this is an expected behavior and whether we should do one of the following:
- don't do safemode wait for reconfiguration
- set dfs.namenode.safemode.threshold-pct to 0 for reconfiguration so namenode would check-in immediately.

Seems like we're doing something suspicious here. I did read about hdfs edit logs, would nn3 be syncing all the hdfs edits from nn2 as it comes up? Do i need to worry about this for reconfiguration? Any recommendations on what logs we should look at or whether this approach seems good to automate? Would really appreciate any feedback.
Thanks,-Sumit