You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Suresh V <ve...@gmail.com> on 2015/08/10 03:12:46 UTC

Active Namenode keeps crashing

In our HA setup, the active namenode keeps crashing once a week or so. The
cluster is quite idle without many jobs running and not much user activity.

Below is logs from journal nodes. Can someone help us with this please?


2015-08-04 13:00:20,054 INFO  server.Journal
(Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
from 9 to 10 for client /172.26.44.133

2015-08-04 13:00:20,175 INFO  server.Journal
(Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)

2015-08-04 13:00:20,220 INFO  server.Journal
(Journal.java:scanStorageForLatestEdits(194)) - Latest log is
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)

2015-08-04 13:00:20,891 INFO  server.Journal
(Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
-> startTxId: 523903 endTxId: 523925 isInProgress: true

2015-08-04 13:00:20,891 INFO  server.Journal
(Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
lastWriterEpoch: 9 lastCommittedTxId: 523924

2015-08-04 13:00:20,956 INFO  server.Journal
(Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
-> startTxId: 523903 endTxId: 523925 isInProgress: true

2015-08-04 13:00:20,956 INFO  server.Journal
(Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
523903 endTxId: 523925 isInProgress: true: already have up-to-date logs

2015-08-04 13:00:20,989 INFO  server.Journal
(Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
acceptedInEpoch: 10

2015-08-04 13:00:21,791 INFO  server.Journal
(Journal.java:finalizeLogSegment(584)) - Validating log segment
/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
about to be finalized

2015-08-04 13:00:21,805 INFO  namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
->
/hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925

2015-08-04 13:00:22,257 INFO  server.Journal
(Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
for client /172.26.44.133

2015-08-04 13:00:23,699 INFO  ipc.Server (Server.java:run(2060)) - IPC
Server handler 4 on 8485, call
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
172.26.44.135:43678 Call#304302 Retry#0

java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10

        at
org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)

        at
org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)

        at
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)

        at
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)

        at
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)

        at
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)

        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)

        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

2015-08-06 19:13:14,012 INFO  httpclient.HttpMethodDirector
(HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
(org.apache.commons.httpclient.NoHttpResponseException) caught when
processing request: The server az-easthdpmnp02.metclouduseast.comfailed to
respond




Thank you

Suresh.

Re: Active Namenode keeps crashing

Posted by Artem Ervits <ar...@gmail.com>.

Check whether connectivity between servers is stable. Error says it can't
reach one node. Also check that time is synched between nodes.
On Aug 9, 2015 9:31 PM, "Suresh V" <ve...@gmail.com> wrote:

> In our HA setup, the active namenode keeps crashing once a week or so. The
> cluster is quite idle without many jobs running and not much user activity.
>
> Below is logs from journal nodes. Can someone help us with this please?
>
>
> 2015-08-04 13:00:20,054 INFO  server.Journal
> (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
> from 9 to 10 for client /172.26.44.133
>
> 2015-08-04 13:00:20,175 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
> FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
>
> 2015-08-04 13:00:20,220 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(194)) - Latest log is
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> lastWriterEpoch: 9 lastCommittedTxId: 523924
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
> 523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
>
> 2015-08-04 13:00:20,989 INFO  server.Journal
> (Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> acceptedInEpoch: 10
>
> 2015-08-04 13:00:21,791 INFO  server.Journal
> (Journal.java:finalizeLogSegment(584)) - Validating log segment
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> about to be finalized
>
> 2015-08-04 13:00:21,805 INFO  namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> ->
> /hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
>
> 2015-08-04 13:00:22,257 INFO  server.Journal
> (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
> for client /172.26.44.133
>
> 2015-08-04 13:00:23,699 INFO  ipc.Server (Server.java:run(2060)) - IPC
> Server handler 4 on 8485, call
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
> 172.26.44.135:43678 Call#304302 Retry#0
>
> java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> 2015-08-06 19:13:14,012 INFO  httpclient.HttpMethodDirector
> (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
> (org.apache.commons.httpclient.NoHttpResponseException) caught when
> processing request: The server az-easthdpmnp02.metclouduseast.comfailed
> to respond
>
>
>
>
> Thank you
>
> Suresh.
>
>
>

Re: Active Namenode keeps crashing

Posted by Artem Ervits <ar...@gmail.com>.

Check whether connectivity between servers is stable. Error says it can't
reach one node. Also check that time is synched between nodes.
On Aug 9, 2015 9:31 PM, "Suresh V" <ve...@gmail.com> wrote:

> In our HA setup, the active namenode keeps crashing once a week or so. The
> cluster is quite idle without many jobs running and not much user activity.
>
> Below is logs from journal nodes. Can someone help us with this please?
>
>
> 2015-08-04 13:00:20,054 INFO  server.Journal
> (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
> from 9 to 10 for client /172.26.44.133
>
> 2015-08-04 13:00:20,175 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
> FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
>
> 2015-08-04 13:00:20,220 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(194)) - Latest log is
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> lastWriterEpoch: 9 lastCommittedTxId: 523924
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
> 523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
>
> 2015-08-04 13:00:20,989 INFO  server.Journal
> (Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> acceptedInEpoch: 10
>
> 2015-08-04 13:00:21,791 INFO  server.Journal
> (Journal.java:finalizeLogSegment(584)) - Validating log segment
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> about to be finalized
>
> 2015-08-04 13:00:21,805 INFO  namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> ->
> /hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
>
> 2015-08-04 13:00:22,257 INFO  server.Journal
> (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
> for client /172.26.44.133
>
> 2015-08-04 13:00:23,699 INFO  ipc.Server (Server.java:run(2060)) - IPC
> Server handler 4 on 8485, call
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
> 172.26.44.135:43678 Call#304302 Retry#0
>
> java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> 2015-08-06 19:13:14,012 INFO  httpclient.HttpMethodDirector
> (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
> (org.apache.commons.httpclient.NoHttpResponseException) caught when
> processing request: The server az-easthdpmnp02.metclouduseast.comfailed
> to respond
>
>
>
>
> Thank you
>
> Suresh.
>
>
>

Re: Active Namenode keeps crashing

Posted by Artem Ervits <ar...@gmail.com>.

Check whether connectivity between servers is stable. Error says it can't
reach one node. Also check that time is synched between nodes.
On Aug 9, 2015 9:31 PM, "Suresh V" <ve...@gmail.com> wrote:

> In our HA setup, the active namenode keeps crashing once a week or so. The
> cluster is quite idle without many jobs running and not much user activity.
>
> Below is logs from journal nodes. Can someone help us with this please?
>
>
> 2015-08-04 13:00:20,054 INFO  server.Journal
> (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
> from 9 to 10 for client /172.26.44.133
>
> 2015-08-04 13:00:20,175 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
> FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
>
> 2015-08-04 13:00:20,220 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(194)) - Latest log is
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> lastWriterEpoch: 9 lastCommittedTxId: 523924
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
> 523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
>
> 2015-08-04 13:00:20,989 INFO  server.Journal
> (Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> acceptedInEpoch: 10
>
> 2015-08-04 13:00:21,791 INFO  server.Journal
> (Journal.java:finalizeLogSegment(584)) - Validating log segment
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> about to be finalized
>
> 2015-08-04 13:00:21,805 INFO  namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> ->
> /hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
>
> 2015-08-04 13:00:22,257 INFO  server.Journal
> (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
> for client /172.26.44.133
>
> 2015-08-04 13:00:23,699 INFO  ipc.Server (Server.java:run(2060)) - IPC
> Server handler 4 on 8485, call
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
> 172.26.44.135:43678 Call#304302 Retry#0
>
> java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> 2015-08-06 19:13:14,012 INFO  httpclient.HttpMethodDirector
> (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
> (org.apache.commons.httpclient.NoHttpResponseException) caught when
> processing request: The server az-easthdpmnp02.metclouduseast.comfailed
> to respond
>
>
>
>
> Thank you
>
> Suresh.
>
>
>

Re: Active Namenode keeps crashing

Posted by Artem Ervits <ar...@gmail.com>.

Check whether connectivity between servers is stable. Error says it can't
reach one node. Also check that time is synched between nodes.
On Aug 9, 2015 9:31 PM, "Suresh V" <ve...@gmail.com> wrote:

> In our HA setup, the active namenode keeps crashing once a week or so. The
> cluster is quite idle without many jobs running and not much user activity.
>
> Below is logs from journal nodes. Can someone help us with this please?
>
>
> 2015-08-04 13:00:20,054 INFO  server.Journal
> (Journal.java:updateLastPromisedEpoch(315)) - Updating lastPromisedEpoch
> from 9 to 10 for client /172.26.44.133
>
> 2015-08-04 13:00:20,175 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(188)) - Scanning storage
> FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
>
> 2015-08-04 13:00:20,220 INFO  server.Journal
> (Journal.java:scanStorageForLatestEdits(194)) - Latest log is
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,891 INFO  server.Journal
> (Journal.java:prepareRecovery(731)) - Prepared recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> lastWriterEpoch: 9 lastCommittedTxId: 523924
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:getSegmentInfo(687)) - getSegmentInfo(523903):
> EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
> -> startTxId: 523903 endTxId: 523925 isInProgress: true
>
> 2015-08-04 13:00:20,956 INFO  server.Journal
> (Journal.java:acceptRecovery(817)) - Skipping download of log startTxId:
> 523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
>
> 2015-08-04 13:00:20,989 INFO  server.Journal
> (Journal.java:acceptRecovery(850)) - Accepted recovery for segment 523903:
> segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true }
> acceptedInEpoch: 10
>
> 2015-08-04 13:00:21,791 INFO  server.Journal
> (Journal.java:finalizeLogSegment(584)) - Validating log segment
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> about to be finalized
>
> 2015-08-04 13:00:21,805 INFO  namenode.FileJournalManager
> (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file
> /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903
> ->
> /hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
>
> 2015-08-04 13:00:22,257 INFO  server.Journal
> (Journal.java:startLogSegment(532)) - Updating lastWriterEpoch from 9 to 10
> for client /172.26.44.133
>
> 2015-08-04 13:00:23,699 INFO  ipc.Server (Server.java:run(2060)) - IPC
> Server handler 4 on 8485, call
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from
> 172.26.44.135:43678 Call#304302 Retry#0
>
> java.io.IOException: IPC's epoch 9 is less than the last promised epoch 10
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
>
>         at
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>
>         at
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> 2015-08-06 19:13:14,012 INFO  httpclient.HttpMethodDirector
> (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception
> (org.apache.commons.httpclient.NoHttpResponseException) caught when
> processing request: The server az-easthdpmnp02.metclouduseast.comfailed
> to respond
>
>
>
>
> Thank you
>
> Suresh.
>
>
>