You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Alejandro Fernandez (JIRA)" <ji...@apache.org> on 2015/06/06 02:24:00 UTC

[jira] [Commented] (AMBARI-11743) NameNode is forced to leave safemode, which causes HBMaster master to crash if done too quickly

    [ https://issues.apache.org/jira/browse/AMBARI-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575449#comment-14575449 ] 

Alejandro Fernandez commented on AMBARI-11743:
----------------------------------------------

In Ambari 1.7.0, https://github.com/apache/ambari/blob/branch-1.7.0/ambari-server/src/main/resources/stacks/HDP/2.0.6/services/HDFS/package/scripts/hdfs_namenode.py
Ambari would *never* force NameNode to leave safemode, https://github.com/apache/ambari/blob/branch-1.7.0/ambari-

In Ambari 2.0.0,
https://github.com/apache/ambari/blob/branch-2.0.0/ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py 
Ambari would *force* NameNode to leave safe mode in certain conditions; this arose due to requirements out of Rolling Upgrade, but the code was performed regardless of RU.

In Ambari 2.1.0,
https://github.com/apache/ambari/blob/branch-2.1/ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
The performance of HDFS commands improved, so Ambari would spend less time between NameNode start and checking for safemode state and then leave, which I believe is the new change that is surfacing these latent issues.

Starting NameNode and waiting for safemode OFF should be independent of whether an RU is happening. However, RU immediately runs HistoryServer start and MR Service Checks after starting NameNode, and those 2 steps require NameNode have safemode OFF.

In summary, I believe the fix is to wait longer for NameNode to reach safemode OFF, by waiting up to 10 mins (since more than 10 will cause the step to timeout)
If NameNode is still in safemode after 10 mins, it is up to the user to retry any subsequent steps. During RU, the user is allowed to retry HistoryServer start and MR Service Check

> NameNode is forced to leave safemode, which causes HBMaster master to crash if done too quickly
> -----------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-11743
>                 URL: https://issues.apache.org/jira/browse/AMBARI-11743
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>
> 1. Install cluster with Ambari 2.1 and HDP 2.3
> 2. Add services HDFS, YARN, MR, ZK, and HBaste
> 3. Perform several Stop All and Start All on HDFS service
> 4. Periodically, HBase Master will crash
> This was a non-HA cluster.
> {code}
> 2015-06-02 09:34:24,865 WARN  [ip-172-31-33-225:16000.activeMasterManager] hdfs.DFSClient: Could not obtain block: BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 file=/apps/hbase/data/hbase.id No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
> 2015-06-02 09:34:24,866 WARN  [ip-172-31-33-225:16000.activeMasterManager] hdfs.DFSClient: DFS Read
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 file=/apps/hbase/data/hbase.id
> 	at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:945)
> 	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:604)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:195)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.hbase.util.FSUtils.getClusterId(FSUtils.java:816)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:474)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:146)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:126)
> 	at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:649)
> 	at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
> 	at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-06-02 09:34:24,870 FATAL [ip-172-31-33-225:16000.activeMasterManager] master.HMaster: Failed to become active master
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-925466282-172.31.33.226-1433234647051:blk_1073741829_1005 file=/apps/hbase/data/hbase.id
> 	at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:945)
> 	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:604)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:195)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.hbase.util.FSUtils.getClusterId(FSUtils.java:816)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:474)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:146)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:126)
> 	at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:649)
> 	at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:182)
> 	at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1646)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)