You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Alejandro Fernandez <af...@hortonworks.com> on 2015/01/15 23:21:06 UTC
Review Request 29950: Intermittent Preparing NAMENODE fails during RU
due to JOURNALNODE quorum not established
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/
-----------------------------------------------------------
Review request for Ambari, Jonathan Hurley, Nate Cole, and Yurii Shylov.
Bugs: AMBARI-9163
https://issues.apache.org/jira/browse/AMBARI-9163
Repository: ambari
Description
-------
The active namenode shutdowns during the first call to get the safemode status.
`
su - hdfs -c 'hdfs dfsadmin -safemode get'
`
returned
`
failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
`
The active namenode shows the following during the same time window,
`
2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
`
This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
The runbook has more details,
`
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
rollEdits at the namenode // hdfs dfsadmin -rollEdit
get “LastAppliedOrWrittenTxId” from NN jmx
wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
}
// Before bringing down a journal node ensure that the other two journal nodes are up
ensureJNsAreUp
for each JN {
do upgrade of one JN
ensureJNsAreUp
}
`
Root caused to:
https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
Diffs
-----
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
Diff: https://reviews.apache.org/r/29950/diff/
Testing
-------
Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
Unit Tests are in progress.
Thanks,
Alejandro Fernandez
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68333
-----------------------------------------------------------
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml
<https://reviews.apache.org/r/29950/#comment112514>
Will have to test this works on a brand new install.
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
<https://reviews.apache.org/r/29950/#comment112515>
Other values are HTTP_ONLY and HTTP_AND_HTTPS
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
<https://reviews.apache.org/r/29950/#comment112516>
I may make these default params in the function signature.
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
<https://reviews.apache.org/r/29950/#comment112517>
May take a while for the namenode switch to happen after killing ZKFC.
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
<https://reviews.apache.org/r/29950/#comment112518>
should maybe check for nn_address.lower().startswith
- Alejandro Fernandez
On Jan. 15, 2015, 10:21 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 10:21 p.m.)
>
>
> Review request for Ambari, Jonathan Hurley, Nate Cole, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests are in progress.
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Nate Cole <nc...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68338
-----------------------------------------------------------
Ship it!
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml
<https://reviews.apache.org/r/29950/#comment112521>
Not sure if we really need this since the 2.2 HDFS metainfo package of hadoop_2_2_* includes the client
- Nate Cole
On Jan. 15, 2015, 5:28 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 5:28 p.m.)
>
>
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests are in progress.
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Tom Beerbower <tb...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68355
-----------------------------------------------------------
Ship it!
Ship It!
- Tom Beerbower
On Jan. 15, 2015, 10:43 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 10:43 p.m.)
>
>
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests passed,
>
> [INFO] ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] ------------------------------------------------------------------------
> [INFO] Total time: 30:23.410s
> [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
> [INFO] Final Memory: 61M/393M
> [INFO] ------------------------------------------------------------------------
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Alejandro Fernandez <af...@hortonworks.com>.
> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py, line 50
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line50>
> >
> > If not specified, this should be defaulted to HTTP_ONLY
Will fix this.
> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py, lines 55-56
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line55>
> >
> > This will not work in HA mode. The NameNode is a combination of `dfs.namenode.http-address`, the HA cluster name, and the `nn` identifier. Such as:
> >
> > dfs.namenode.http-address.c1ha.nn2
With the current code, it returns a value like "c6408.ambari.apache.org:50070"
And the function get_jmx_data will convert it to something like "http://c6408.ambari.apache.org:50070/jmx", which does appear to work
> On Jan. 16, 2015, 12:07 a.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py, lines 87-88
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line87>
> >
> > kinit needed here?
kinit happens just before in line 83
- Alejandro
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68366
-----------------------------------------------------------
On Jan. 15, 2015, 10:43 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 10:43 p.m.)
>
>
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests passed,
>
> [INFO] ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] ------------------------------------------------------------------------
> [INFO] Total time: 30:23.410s
> [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
> [INFO] Final Memory: 61M/393M
> [INFO] ------------------------------------------------------------------------
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Jonathan Hurley <jh...@hortonworks.com>.
> On Jan. 15, 2015, 7:07 p.m., Jonathan Hurley wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py, lines 55-56
> > <https://reviews.apache.org/r/29950/diff/1/?file=823094#file823094line55>
> >
> > This will not work in HA mode. The NameNode is a combination of `dfs.namenode.http-address`, the HA cluster name, and the `nn` identifier. Such as:
> >
> > dfs.namenode.http-address.c1ha.nn2
>
> Alejandro Fernandez wrote:
> With the current code, it returns a value like "c6408.ambari.apache.org:50070"
> And the function get_jmx_data will convert it to something like "http://c6408.ambari.apache.org:50070/jmx", which does appear to work
I still think that this is an issue. Consider the following; in my cluster, I have
`hdfs-site/dfs.namenode.http-address` as `c6401.ambari.apache.org:50070`
`hdfs-site/dfs.namenode.http-address.c1ha.nn1` as `c6401.ambari.apache.org:50071`
My NameNode is on 50071, not 50070. We need to open a Jira to track this. Can you open one up?
- Jonathan
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68366
-----------------------------------------------------------
On Jan. 15, 2015, 5:43 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 5:43 p.m.)
>
>
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests passed,
>
> [INFO] ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] ------------------------------------------------------------------------
> [INFO] Total time: 30:23.410s
> [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
> [INFO] Final Memory: 61M/393M
> [INFO] ------------------------------------------------------------------------
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Jonathan Hurley <jh...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/#review68366
-----------------------------------------------------------
Some issues with NN HA mode; Also, I am unable to apply this patch to my trunk HEAD in order to test it. Please refresh the patch and I'll try it out.
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
<https://reviews.apache.org/r/29950/#comment112546>
If not specified, this should be defaulted to HTTP_ONLY
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
<https://reviews.apache.org/r/29950/#comment112549>
This will not work in HA mode. The NameNode is a combination of `dfs.namenode.http-address`, the HA cluster name, and the `nn` identifier. Such as:
dfs.namenode.http-address.c1ha.nn2
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py
<https://reviews.apache.org/r/29950/#comment112550>
kinit needed here?
- Jonathan Hurley
On Jan. 15, 2015, 5:43 p.m., Alejandro Fernandez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29950/
> -----------------------------------------------------------
>
> (Updated Jan. 15, 2015, 5:43 p.m.)
>
>
> Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
>
>
> Bugs: AMBARI-9163
> https://issues.apache.org/jira/browse/AMBARI-9163
>
>
> Repository: ambari
>
>
> Description
> -------
>
> The active namenode shutdowns during the first call to get the safemode status.
> `
> su - hdfs -c 'hdfs dfsadmin -safemode get'
> `
>
> returned
> `
> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
> `
>
> The active namenode shows the following during the same time window,
> `
> 2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
> at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
> at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.journal(Unknown Source)
> at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
> at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> `
>
> This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
>
> Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
>
> The runbook has more details,
> `
> // Function to ensure all JNs are up and are functional
> ensureJNsAreUp(Jn1, Jn2, Jn3) {
> rollEdits at the namenode // hdfs dfsadmin -rollEdit
> get “LastAppliedOrWrittenTxId” from NN jmx
> wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
> }
>
> // Before bringing down a journal node ensure that the other two journal nodes are up
> ensureJNsAreUp
> for each JN {
> do upgrade of one JN
> ensureJNsAreUp
> }
>
> `
>
> Root caused to:
> https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
>
> Diff: https://reviews.apache.org/r/29950/diff/
>
>
> Testing
> -------
>
> Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
> Unit Tests passed,
>
> [INFO] ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] ------------------------------------------------------------------------
> [INFO] Total time: 30:23.410s
> [INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
> [INFO] Final Memory: 61M/393M
> [INFO] ------------------------------------------------------------------------
>
>
> Thanks,
>
> Alejandro Fernandez
>
>
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/
-----------------------------------------------------------
(Updated Jan. 15, 2015, 10:43 p.m.)
Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
Bugs: AMBARI-9163
https://issues.apache.org/jira/browse/AMBARI-9163
Repository: ambari
Description
-------
The active namenode shutdowns during the first call to get the safemode status.
`
su - hdfs -c 'hdfs dfsadmin -safemode get'
`
returned
`
failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
`
The active namenode shows the following during the same time window,
`
2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
`
This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
The runbook has more details,
`
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
rollEdits at the namenode // hdfs dfsadmin -rollEdit
get “LastAppliedOrWrittenTxId” from NN jmx
wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
}
// Before bringing down a journal node ensure that the other two journal nodes are up
ensureJNsAreUp
for each JN {
do upgrade of one JN
ensureJNsAreUp
}
`
Root caused to:
https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
Diffs
-----
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
Diff: https://reviews.apache.org/r/29950/diff/
Testing (updated)
-------
Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
Unit Tests passed,
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30:23.410s
[INFO] Finished at: Thu Jan 15 14:43:23 PST 2015
[INFO] Final Memory: 61M/393M
[INFO] ------------------------------------------------------------------------
Thanks,
Alejandro Fernandez
Re: Review Request 29950: Intermittent Preparing NAMENODE fails
during RU due to JOURNALNODE quorum not established
Posted by Alejandro Fernandez <af...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29950/
-----------------------------------------------------------
(Updated Jan. 15, 2015, 10:28 p.m.)
Review request for Ambari, Dmitro Lisnichenko, Jonathan Hurley, Nate Cole, Srimanth Gunturi, Sid Wagle, Tom Beerbower, and Yurii Shylov.
Bugs: AMBARI-9163
https://issues.apache.org/jira/browse/AMBARI-9163
Repository: ambari
Description
-------
The active namenode shutdowns during the first call to get the safemode status.
`
su - hdfs -c 'hdfs dfsadmin -safemode get'
`
returned
`
failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
`
The active namenode shows the following during the same time window,
`
2015-01-15 00:35:04,233 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.64.106:8485 failed to write txns 52-52. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:470)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:344)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy12.journal(Unknown Source)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
`
This issue is intermittent because it depends on the behavior of the Journalnodes, so this will require more work to the scripts.
Today, our orchestration restarts one Journalnode at a time. However, the current log segment is null because it has not yet rolled to a new one, which can be forced by the command "hdfs dfsadmin -rollEdit" and waiting til some conditions are true.
The runbook has more details,
`
// Function to ensure all JNs are up and are functional
ensureJNsAreUp(Jn1, Jn2, Jn3) {
rollEdits at the namenode // hdfs dfsadmin -rollEdit
get “LastAppliedOrWrittenTxId” from NN jmx
wait till "LastWrittenTxId" from all JNs is >= previous step transaction ID, timeout after 3 mins
}
// Before bringing down a journal node ensure that the other two journal nodes are up
ensureJNsAreUp
for each JN {
do upgrade of one JN
ensureJNsAreUp
}
`
Root caused to:
https://github.com/apache/hadoop/blob/ae91b13a4b1896b893268253104f935c3078d345/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java line 344
Diffs
-----
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/metainfo.xml ce0ab297a8c8e665e8ffde79b9b36be2d29d117c
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode.py 15e068947307a321566385fb670232af7f78d71b
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/journalnode_upgrade.py PRE-CREATION
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py 93efae35281e7d3d175ecc95b3af4e531cf69b64
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py f185ea0d6b2e7dfe1cd8ce95287d2a2f1970e682
Diff: https://reviews.apache.org/r/29950/diff/
Testing
-------
Copied changes files to a 3-node HA cluster and verified that the upgrade worked twice.
Unit Tests are in progress.
Thanks,
Alejandro Fernandez