You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Alejandro Fernandez (JIRA)" <ji...@apache.org> on 2015/01/15 23:17:34 UTC
[jira] [Created] (AMBARI-9163) Intermittent Preparing NAMENODE
fails during RU due to JOURNALNODE quorum not established
Alejandro Fernandez created AMBARI-9163:
-------------------------------------------
Summary: Intermittent Preparing NAMENODE fails during RU due to JOURNALNODE quorum not established
Key: AMBARI-9163
URL: https://issues.apache.org/jira/browse/AMBARI-9163
Project: Ambari
Issue Type: Bug
Components: ambari-server
Affects Versions: 2.0.0
Reporter: Alejandro Fernandez
Assignee: Alejandro Fernandez
Priority: Blocker
Fix For: 2.0.0
The active namenode shutdowns during the first call to get the safemode status.
{code}
su - hdfs -c 'hdfs dfsadmin -safemode get'
{code}
returned
{code}
failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
{code}
The active namenode shows the following during the same time window,
{code}
2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}
This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
The runbook has more details,
{code}
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
rollEdits at the namenode // hdfs dfsadmin -rollEdit
get “LastAppliedOrWrittenTxId” from NN jmx
wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
}
// Before bringing down a journal node ensure that the other two journal nodes are up
ensureJNsAreUp
for each JN {
do upgrade of one JN
ensureJNsAreUp
}
{code}
Root caused to:
https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)