You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Siddharth Wagle (JIRA)" <ji...@apache.org> on 2013/09/27 23:02:03 UTC

[jira] [Commented] (AMBARI-3368) NameNode start hangs with HA config'd

    [ https://issues.apache.org/jira/browse/AMBARI-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780392#comment-13780392 ] 

Siddharth Wagle commented on AMBARI-3368:
-----------------------------------------

Upon further investigation we find that the dfs client tries to connect to original NN and when the connection times out it tries the other NN. 
This will result in slow down of jobs running after failover.

{code}
[root@ambari-nn-ha-2 data]# time su - hdfs -c 'hadoop --config /etc/hadoop/conf fs -chown hcat /user/hcat'
13/09/24 14:09:48 DEBUG retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1496)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1029)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3269)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(Na
{code}

Time:
{code}
real	0m3.996s
user	0m2.697s
sys	0m0.147s
{code}
                
> NameNode start hangs with HA config'd
> -------------------------------------
>
>                 Key: AMBARI-3368
>                 URL: https://issues.apache.org/jira/browse/AMBARI-3368
>             Project: Ambari
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.4.1
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>             Fix For: 1.4.1
>
>
> After configuring NameNode HA, I found starting a namenode hangs and fails with "Puppet has been killed due to timeout"
> 1) Install cluster
> 2) enable NameNode HA
> 3) Stop standby namenode on Hosts details page
> 4) Stop active namenode on Hosts details page
> 5) Start namenode on Hosts details page
> 6) Hangs on start. stops at 35% complete. Then after ~ 10 minutes, puppet has been killed due to timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira