You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Shaik M <mu...@gmail.com> on 2016/04/28 13:32:44 UTC

NameNode Crashing with "flush failed for required journal" exception

Hi All,

I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
Kerberos security.

NameNode having  HA and it is crashing at least once in a day with "flush
failed for required journal " exception. don't have any network issues
between the nodes.

I have tried to find the causing the issue,  but, i couldn't able to found
proper resolution. Please help me to fix this issue.

Thank you,
Shaik

2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
(QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for a
response for sendEdits. Succeeded so far: [10.192.149.194:8485]
2016-04-28 05:05:23,483 INFO  BlockStateChange
(BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
neededReplications = 0, pendingReplications = 0.
2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
(QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for a
response for sendEdits. Succeeded so far: [10.192.149.194:8485]
2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
(JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
starting at txid 26198626))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
respond.
        at
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3492)
        at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:787)
        at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
        at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
(QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting
at txid 26198626
2016-04-28 05:05:25,150 INFO  util.ExitUtil (ExitUtil.java:terminate(124))
- Exiting with status 1
2016-04-28 05:05:25,160 INFO  namenode.NameNode (LogAdapter.java:info(47))
- SHUTDOWN_MSG:

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Shaik M <mu...@gmail.com>.
Hi Chris,

After installing "NSCD" service on Hadoop Cluster, NameNode is running
stable without any downtime from last three days. :)

Thanks you for your help.

Regards,
Shaik



On 29 April 2016 at 11:43, Shaik M <mu...@gmail.com> wrote:

> Thank you for your suggestions.
>
> I found in logs
> "WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
> performance problem: getGroups(user=hdfs) took 15915 milliseconds.
>
> First I'll deploy "nscd" service on all three journal nodes and will
> update you accordingly.
>
> Thanks,
> Shaik
>
> On 29 April 2016 at 02:08, Chris Nauroth <cn...@hortonworks.com> wrote:
>
>> A problem I've seen a few times is that slow lookups of the hdfs user's
>> groups at the JournalNode introduce delays in handling the edit logging
>> RPC, which then times out at the NameNode side, ultimately causing an
>> abort and an HA failover.  If your environment is experiencing this, then
>> you'll see messages in the JournalNode logs about "Potential performance
>> problem: getGroups".  If this is happening, then there are several
>> potential fixes.
>>
>> 1. Ultimately, root cause is a performance problem in the infrastructure's
>> ability to lookup the groups for a user.  This warrants investigation into
>> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
>> something like ActiveDirectory is common in a lot of IT shops.)  It's
>> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
>> Service Cache Daemon) to improve performance of group lookups and reduce
>> load on the infrastructure in this kind of deployment.
>>
>> 2. A potential workaround is to use Hadoop's static group mapping feature
>> to define the hdfs user's list of groups in configuration.  This way, the
>> group lookup of the hdfs user performed by the JournalNode never hits the
>> group lookup infrastructure at all.  The downside is that managing group
>> memberships in Hadoop configuration files is much more cumbersome than
>> managing it externally.  For more information, see the documentation of
>> the configuration property hadoop.user.group.static.mapping.overrides in
>> core-default.xml. [1]
>>
>> 3. Another potential workaround is to increase the timeouts allowed for
>> the JournalNode RPC calls.  I haven't had as much success with this
>> myself, but it's possible.  For more information on how to configure this,
>> see the documentation of the various dfs.qjournal.*.timeout settings in
>> hdfs-default.xml. [2]
>>
>> --Chris Nauroth
>>
>> [1] https://s.apache.org/kX8D
>> [2] https://s.apache.org/LzJd
>>
>>
>>
>> On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>>
>> >Hi Shaik,
>> >
>> >The error basically indicates that namenode crashed waiting for the
>> >write and sync to happen on the quorum of JournalNodes. In your case
>> >atleast 2 journal nodes should complete the write and sync without the
>> >timeout period of 20 seconds which does not seems to be the case.
>> >
>> >I will advice you to verify the journal node logs and you should find
>> >something interesting on them. Maybe some reasons for failures to
>> >complete the write and sync operation on journal nodes.
>> >
>> >
>> >Regards,
>> >Gagan Brahmi
>> >
>> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
>> >> Hi All,
>> >>
>> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> >> Kerberos security.
>> >>
>> >> NameNode having  HA and it is crashing at least once in a day with
>> >>"flush
>> >> failed for required journal " exception. don't have any network issues
>> >> between the nodes.
>> >>
>> >> I have tried to find the causing the issue,  but, i couldn't able to
>> >>found
>> >> proper resolution. Please help me to fix this issue.
>> >>
>> >> Thank you,
>> >> Shaik
>> >>
>> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>> >>a
>> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> >> neededReplications = 0, pendingReplications = 0.
>> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>> >>a
>> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>> >>for
>> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> >> starting at txid 26198626))
>> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> >> respond.
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>> >>AsyncLoggerSet.java:137)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>> >>orumOutputStream.java:107)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>> >>utputStream.java:113)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>> >>utputStream.java:107)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>> >>8.apply(JournalSet.java:533)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>> >>ors(JournalSet.java:393)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>> >>ava:57)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>> >>flush(JournalSet.java:529)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>> >>47)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>> >>stem.java:3492)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>> >>eRpcServer.java:787)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>> >>otobufRpcEngine.java:616)
>> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>> >>         at java.security.AccessController.doPrivileged(Native Method)
>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> >>         at
>> >>
>>
>> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>> >>.java:1657)
>> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>> >>starting
>> >> at txid 26198626
>> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>> >>(ExitUtil.java:terminate(124)) -
>> >> Exiting with status 1
>> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>> >>(LogAdapter.java:info(47)) -
>> >> SHUTDOWN_MSG:
>> >>
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> >For additional commands, e-mail: user-help@hadoop.apache.org
>> >
>> >
>>
>>
>

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Shaik M <mu...@gmail.com>.
Thank you for your suggestions.

I found in logs
"WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
performance problem: getGroups(user=hdfs) took 15915 milliseconds.

First I'll deploy "nscd" service on all three journal nodes and will update
you accordingly.

Thanks,
Shaik

On 29 April 2016 at 02:08, Chris Nauroth <cn...@hortonworks.com> wrote:

> A problem I've seen a few times is that slow lookups of the hdfs user's
> groups at the JournalNode introduce delays in handling the edit logging
> RPC, which then times out at the NameNode side, ultimately causing an
> abort and an HA failover.  If your environment is experiencing this, then
> you'll see messages in the JournalNode logs about "Potential performance
> problem: getGroups".  If this is happening, then there are several
> potential fixes.
>
> 1. Ultimately, root cause is a performance problem in the infrastructure's
> ability to lookup the groups for a user.  This warrants investigation into
> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
> something like ActiveDirectory is common in a lot of IT shops.)  It's
> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
> Service Cache Daemon) to improve performance of group lookups and reduce
> load on the infrastructure in this kind of deployment.
>
> 2. A potential workaround is to use Hadoop's static group mapping feature
> to define the hdfs user's list of groups in configuration.  This way, the
> group lookup of the hdfs user performed by the JournalNode never hits the
> group lookup infrastructure at all.  The downside is that managing group
> memberships in Hadoop configuration files is much more cumbersome than
> managing it externally.  For more information, see the documentation of
> the configuration property hadoop.user.group.static.mapping.overrides in
> core-default.xml. [1]
>
> 3. Another potential workaround is to increase the timeouts allowed for
> the JournalNode RPC calls.  I haven't had as much success with this
> myself, but it's possible.  For more information on how to configure this,
> see the documentation of the various dfs.qjournal.*.timeout settings in
> hdfs-default.xml. [2]
>
> --Chris Nauroth
>
> [1] https://s.apache.org/kX8D
> [2] https://s.apache.org/LzJd
>
>
>
> On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> >Hi Shaik,
> >
> >The error basically indicates that namenode crashed waiting for the
> >write and sync to happen on the quorum of JournalNodes. In your case
> >atleast 2 journal nodes should complete the write and sync without the
> >timeout period of 20 seconds which does not seems to be the case.
> >
> >I will advice you to verify the journal node logs and you should find
> >something interesting on them. Maybe some reasons for failures to
> >complete the write and sync operation on journal nodes.
> >
> >
> >Regards,
> >Gagan Brahmi
> >
> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> >> Hi All,
> >>
> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> >> Kerberos security.
> >>
> >> NameNode having  HA and it is crashing at least once in a day with
> >>"flush
> >> failed for required journal " exception. don't have any network issues
> >> between the nodes.
> >>
> >> I have tried to find the causing the issue,  but, i couldn't able to
> >>found
> >> proper resolution. Please help me to fix this issue.
> >>
> >> Thank you,
> >> Shaik
> >>
> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> >> neededReplications = 0, pendingReplications = 0.
> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
> >>for
> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> >> starting at txid 26198626))
> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> >> respond.
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
> >>AsyncLoggerSet.java:137)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
> >>orumOutputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:113)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
> >>8.apply(JournalSet.java:533)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
> >>ors(JournalSet.java:393)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
> >>ava:57)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
> >>flush(JournalSet.java:529)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
> >>47)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
> >>stem.java:3492)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
> >>eRpcServer.java:787)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >>         at
> >>
> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
> >>otobufRpcEngine.java:616)
> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1657)
> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
> >>starting
> >> at txid 26198626
> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
> >>(ExitUtil.java:terminate(124)) -
> >> Exiting with status 1
> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
> >>(LogAdapter.java:info(47)) -
> >> SHUTDOWN_MSG:
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >For additional commands, e-mail: user-help@hadoop.apache.org
> >
> >
>
>

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Shaik M <mu...@gmail.com>.
Thank you for your suggestions.

I found in logs
"WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
performance problem: getGroups(user=hdfs) took 15915 milliseconds.

First I'll deploy "nscd" service on all three journal nodes and will update
you accordingly.

Thanks,
Shaik

On 29 April 2016 at 02:08, Chris Nauroth <cn...@hortonworks.com> wrote:

> A problem I've seen a few times is that slow lookups of the hdfs user's
> groups at the JournalNode introduce delays in handling the edit logging
> RPC, which then times out at the NameNode side, ultimately causing an
> abort and an HA failover.  If your environment is experiencing this, then
> you'll see messages in the JournalNode logs about "Potential performance
> problem: getGroups".  If this is happening, then there are several
> potential fixes.
>
> 1. Ultimately, root cause is a performance problem in the infrastructure's
> ability to lookup the groups for a user.  This warrants investigation into
> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
> something like ActiveDirectory is common in a lot of IT shops.)  It's
> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
> Service Cache Daemon) to improve performance of group lookups and reduce
> load on the infrastructure in this kind of deployment.
>
> 2. A potential workaround is to use Hadoop's static group mapping feature
> to define the hdfs user's list of groups in configuration.  This way, the
> group lookup of the hdfs user performed by the JournalNode never hits the
> group lookup infrastructure at all.  The downside is that managing group
> memberships in Hadoop configuration files is much more cumbersome than
> managing it externally.  For more information, see the documentation of
> the configuration property hadoop.user.group.static.mapping.overrides in
> core-default.xml. [1]
>
> 3. Another potential workaround is to increase the timeouts allowed for
> the JournalNode RPC calls.  I haven't had as much success with this
> myself, but it's possible.  For more information on how to configure this,
> see the documentation of the various dfs.qjournal.*.timeout settings in
> hdfs-default.xml. [2]
>
> --Chris Nauroth
>
> [1] https://s.apache.org/kX8D
> [2] https://s.apache.org/LzJd
>
>
>
> On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> >Hi Shaik,
> >
> >The error basically indicates that namenode crashed waiting for the
> >write and sync to happen on the quorum of JournalNodes. In your case
> >atleast 2 journal nodes should complete the write and sync without the
> >timeout period of 20 seconds which does not seems to be the case.
> >
> >I will advice you to verify the journal node logs and you should find
> >something interesting on them. Maybe some reasons for failures to
> >complete the write and sync operation on journal nodes.
> >
> >
> >Regards,
> >Gagan Brahmi
> >
> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> >> Hi All,
> >>
> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> >> Kerberos security.
> >>
> >> NameNode having  HA and it is crashing at least once in a day with
> >>"flush
> >> failed for required journal " exception. don't have any network issues
> >> between the nodes.
> >>
> >> I have tried to find the causing the issue,  but, i couldn't able to
> >>found
> >> proper resolution. Please help me to fix this issue.
> >>
> >> Thank you,
> >> Shaik
> >>
> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> >> neededReplications = 0, pendingReplications = 0.
> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
> >>for
> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> >> starting at txid 26198626))
> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> >> respond.
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
> >>AsyncLoggerSet.java:137)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
> >>orumOutputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:113)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
> >>8.apply(JournalSet.java:533)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
> >>ors(JournalSet.java:393)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
> >>ava:57)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
> >>flush(JournalSet.java:529)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
> >>47)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
> >>stem.java:3492)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
> >>eRpcServer.java:787)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >>         at
> >>
> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
> >>otobufRpcEngine.java:616)
> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1657)
> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
> >>starting
> >> at txid 26198626
> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
> >>(ExitUtil.java:terminate(124)) -
> >> Exiting with status 1
> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
> >>(LogAdapter.java:info(47)) -
> >> SHUTDOWN_MSG:
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >For additional commands, e-mail: user-help@hadoop.apache.org
> >
> >
>
>

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Shaik M <mu...@gmail.com>.
Thank you for your suggestions.

I found in logs
"WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
performance problem: getGroups(user=hdfs) took 15915 milliseconds.

First I'll deploy "nscd" service on all three journal nodes and will update
you accordingly.

Thanks,
Shaik

On 29 April 2016 at 02:08, Chris Nauroth <cn...@hortonworks.com> wrote:

> A problem I've seen a few times is that slow lookups of the hdfs user's
> groups at the JournalNode introduce delays in handling the edit logging
> RPC, which then times out at the NameNode side, ultimately causing an
> abort and an HA failover.  If your environment is experiencing this, then
> you'll see messages in the JournalNode logs about "Potential performance
> problem: getGroups".  If this is happening, then there are several
> potential fixes.
>
> 1. Ultimately, root cause is a performance problem in the infrastructure's
> ability to lookup the groups for a user.  This warrants investigation into
> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
> something like ActiveDirectory is common in a lot of IT shops.)  It's
> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
> Service Cache Daemon) to improve performance of group lookups and reduce
> load on the infrastructure in this kind of deployment.
>
> 2. A potential workaround is to use Hadoop's static group mapping feature
> to define the hdfs user's list of groups in configuration.  This way, the
> group lookup of the hdfs user performed by the JournalNode never hits the
> group lookup infrastructure at all.  The downside is that managing group
> memberships in Hadoop configuration files is much more cumbersome than
> managing it externally.  For more information, see the documentation of
> the configuration property hadoop.user.group.static.mapping.overrides in
> core-default.xml. [1]
>
> 3. Another potential workaround is to increase the timeouts allowed for
> the JournalNode RPC calls.  I haven't had as much success with this
> myself, but it's possible.  For more information on how to configure this,
> see the documentation of the various dfs.qjournal.*.timeout settings in
> hdfs-default.xml. [2]
>
> --Chris Nauroth
>
> [1] https://s.apache.org/kX8D
> [2] https://s.apache.org/LzJd
>
>
>
> On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> >Hi Shaik,
> >
> >The error basically indicates that namenode crashed waiting for the
> >write and sync to happen on the quorum of JournalNodes. In your case
> >atleast 2 journal nodes should complete the write and sync without the
> >timeout period of 20 seconds which does not seems to be the case.
> >
> >I will advice you to verify the journal node logs and you should find
> >something interesting on them. Maybe some reasons for failures to
> >complete the write and sync operation on journal nodes.
> >
> >
> >Regards,
> >Gagan Brahmi
> >
> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> >> Hi All,
> >>
> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> >> Kerberos security.
> >>
> >> NameNode having  HA and it is crashing at least once in a day with
> >>"flush
> >> failed for required journal " exception. don't have any network issues
> >> between the nodes.
> >>
> >> I have tried to find the causing the issue,  but, i couldn't able to
> >>found
> >> proper resolution. Please help me to fix this issue.
> >>
> >> Thank you,
> >> Shaik
> >>
> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> >> neededReplications = 0, pendingReplications = 0.
> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
> >>for
> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> >> starting at txid 26198626))
> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> >> respond.
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
> >>AsyncLoggerSet.java:137)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
> >>orumOutputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:113)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
> >>8.apply(JournalSet.java:533)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
> >>ors(JournalSet.java:393)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
> >>ava:57)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
> >>flush(JournalSet.java:529)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
> >>47)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
> >>stem.java:3492)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
> >>eRpcServer.java:787)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >>         at
> >>
> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
> >>otobufRpcEngine.java:616)
> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1657)
> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
> >>starting
> >> at txid 26198626
> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
> >>(ExitUtil.java:terminate(124)) -
> >> Exiting with status 1
> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
> >>(LogAdapter.java:info(47)) -
> >> SHUTDOWN_MSG:
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >For additional commands, e-mail: user-help@hadoop.apache.org
> >
> >
>
>

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Shaik M <mu...@gmail.com>.
Thank you for your suggestions.

I found in logs
"WARN  security.Groups (Groups.java:fetchGroupList(244)) - Potential
performance problem: getGroups(user=hdfs) took 15915 milliseconds.

First I'll deploy "nscd" service on all three journal nodes and will update
you accordingly.

Thanks,
Shaik

On 29 April 2016 at 02:08, Chris Nauroth <cn...@hortonworks.com> wrote:

> A problem I've seen a few times is that slow lookups of the hdfs user's
> groups at the JournalNode introduce delays in handling the edit logging
> RPC, which then times out at the NameNode side, ultimately causing an
> abort and an HA failover.  If your environment is experiencing this, then
> you'll see messages in the JournalNode logs about "Potential performance
> problem: getGroups".  If this is happening, then there are several
> potential fixes.
>
> 1. Ultimately, root cause is a performance problem in the infrastructure's
> ability to lookup the groups for a user.  This warrants investigation into
> whatever that infrastructure is.  (i.e. PAM/LDAP integration with
> something like ActiveDirectory is common in a lot of IT shops.)  It's
> extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
> Service Cache Daemon) to improve performance of group lookups and reduce
> load on the infrastructure in this kind of deployment.
>
> 2. A potential workaround is to use Hadoop's static group mapping feature
> to define the hdfs user's list of groups in configuration.  This way, the
> group lookup of the hdfs user performed by the JournalNode never hits the
> group lookup infrastructure at all.  The downside is that managing group
> memberships in Hadoop configuration files is much more cumbersome than
> managing it externally.  For more information, see the documentation of
> the configuration property hadoop.user.group.static.mapping.overrides in
> core-default.xml. [1]
>
> 3. Another potential workaround is to increase the timeouts allowed for
> the JournalNode RPC calls.  I haven't had as much success with this
> myself, but it's possible.  For more information on how to configure this,
> see the documentation of the various dfs.qjournal.*.timeout settings in
> hdfs-default.xml. [2]
>
> --Chris Nauroth
>
> [1] https://s.apache.org/kX8D
> [2] https://s.apache.org/LzJd
>
>
>
> On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:
>
> >Hi Shaik,
> >
> >The error basically indicates that namenode crashed waiting for the
> >write and sync to happen on the quorum of JournalNodes. In your case
> >atleast 2 journal nodes should complete the write and sync without the
> >timeout period of 20 seconds which does not seems to be the case.
> >
> >I will advice you to verify the journal node logs and you should find
> >something interesting on them. Maybe some reasons for failures to
> >complete the write and sync operation on journal nodes.
> >
> >
> >Regards,
> >Gagan Brahmi
> >
> >On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> >> Hi All,
> >>
> >> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> >> Kerberos security.
> >>
> >> NameNode having  HA and it is crashing at least once in a day with
> >>"flush
> >> failed for required journal " exception. don't have any network issues
> >> between the nodes.
> >>
> >> I have tried to find the causing the issue,  but, i couldn't able to
> >>found
> >> proper resolution. Please help me to fix this issue.
> >>
> >> Thank you,
> >> Shaik
> >>
> >> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> >> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> >> neededReplications = 0, pendingReplications = 0.
> >> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> >> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
> >>a
> >> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> >> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> >> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
> >>for
> >> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> >> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> >> starting at txid 26198626))
> >> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> >> respond.
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
> >>AsyncLoggerSet.java:137)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
> >>orumOutputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:113)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
> >>utputStream.java:107)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
> >>8.apply(JournalSet.java:533)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
> >>ors(JournalSet.java:393)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
> >>ava:57)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
> >>flush(JournalSet.java:529)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
> >>47)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
> >>stem.java:3492)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
> >>eRpcServer.java:787)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
> >>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
> >>         at
> >>
> >>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
> >>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >>         at
> >>
> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
> >>otobufRpcEngine.java:616)
> >>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> >>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1657)
> >>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> >> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> >> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
> >>starting
> >> at txid 26198626
> >> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
> >>(ExitUtil.java:terminate(124)) -
> >> Exiting with status 1
> >> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
> >>(LogAdapter.java:info(47)) -
> >> SHUTDOWN_MSG:
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> >For additional commands, e-mail: user-help@hadoop.apache.org
> >
> >
>
>

Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Chris Nauroth <cn...@hortonworks.com>.
A problem I've seen a few times is that slow lookups of the hdfs user's
groups at the JournalNode introduce delays in handling the edit logging
RPC, which then times out at the NameNode side, ultimately causing an
abort and an HA failover.  If your environment is experiencing this, then
you'll see messages in the JournalNode logs about "Potential performance
problem: getGroups".  If this is happening, then there are several
potential fixes.

1. Ultimately, root cause is a performance problem in the infrastructure's
ability to lookup the groups for a user.  This warrants investigation into
whatever that infrastructure is.  (i.e. PAM/LDAP integration with
something like ActiveDirectory is common in a lot of IT shops.)  It's
extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
Service Cache Daemon) to improve performance of group lookups and reduce
load on the infrastructure in this kind of deployment.

2. A potential workaround is to use Hadoop's static group mapping feature
to define the hdfs user's list of groups in configuration.  This way, the
group lookup of the hdfs user performed by the JournalNode never hits the
group lookup infrastructure at all.  The downside is that managing group
memberships in Hadoop configuration files is much more cumbersome than
managing it externally.  For more information, see the documentation of
the configuration property hadoop.user.group.static.mapping.overrides in
core-default.xml. [1]

3. Another potential workaround is to increase the timeouts allowed for
the JournalNode RPC calls.  I haven't had as much success with this
myself, but it's possible.  For more information on how to configure this,
see the documentation of the various dfs.qjournal.*.timeout settings in
hdfs-default.xml. [2]

--Chris Nauroth

[1] https://s.apache.org/kX8D
[2] https://s.apache.org/LzJd



On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:

>Hi Shaik,
>
>The error basically indicates that namenode crashed waiting for the
>write and sync to happen on the quorum of JournalNodes. In your case
>atleast 2 journal nodes should complete the write and sync without the
>timeout period of 20 seconds which does not seems to be the case.
>
>I will advice you to verify the journal node logs and you should find
>something interesting on them. Maybe some reasons for failures to
>complete the write and sync operation on journal nodes.
>
>
>Regards,
>Gagan Brahmi
>
>On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
>> Hi All,
>>
>> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> Kerberos security.
>>
>> NameNode having  HA and it is crashing at least once in a day with
>>"flush
>> failed for required journal " exception. don't have any network issues
>> between the nodes.
>>
>> I have tried to find the causing the issue,  but, i couldn't able to
>>found
>> proper resolution. Please help me to fix this issue.
>>
>> Thank you,
>> Shaik
>>
>> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> neededReplications = 0, pendingReplications = 0.
>> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>>for
>> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> starting at txid 26198626))
>> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> respond.
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>>AsyncLoggerSet.java:137)
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>>orumOutputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:113)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>>8.apply(JournalSet.java:533)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>>ors(JournalSet.java:393)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>>ava:57)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>>flush(JournalSet.java:529)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>>47)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>>stem.java:3492)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>>eRpcServer.java:787)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>         at
>> 
>>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>>otobufRpcEngine.java:616)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1657)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>>starting
>> at txid 26198626
>> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>>(ExitUtil.java:terminate(124)) -
>> Exiting with status 1
>> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>>(LogAdapter.java:info(47)) -
>> SHUTDOWN_MSG:
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Chris Nauroth <cn...@hortonworks.com>.
A problem I've seen a few times is that slow lookups of the hdfs user's
groups at the JournalNode introduce delays in handling the edit logging
RPC, which then times out at the NameNode side, ultimately causing an
abort and an HA failover.  If your environment is experiencing this, then
you'll see messages in the JournalNode logs about "Potential performance
problem: getGroups".  If this is happening, then there are several
potential fixes.

1. Ultimately, root cause is a performance problem in the infrastructure's
ability to lookup the groups for a user.  This warrants investigation into
whatever that infrastructure is.  (i.e. PAM/LDAP integration with
something like ActiveDirectory is common in a lot of IT shops.)  It's
extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
Service Cache Daemon) to improve performance of group lookups and reduce
load on the infrastructure in this kind of deployment.

2. A potential workaround is to use Hadoop's static group mapping feature
to define the hdfs user's list of groups in configuration.  This way, the
group lookup of the hdfs user performed by the JournalNode never hits the
group lookup infrastructure at all.  The downside is that managing group
memberships in Hadoop configuration files is much more cumbersome than
managing it externally.  For more information, see the documentation of
the configuration property hadoop.user.group.static.mapping.overrides in
core-default.xml. [1]

3. Another potential workaround is to increase the timeouts allowed for
the JournalNode RPC calls.  I haven't had as much success with this
myself, but it's possible.  For more information on how to configure this,
see the documentation of the various dfs.qjournal.*.timeout settings in
hdfs-default.xml. [2]

--Chris Nauroth

[1] https://s.apache.org/kX8D
[2] https://s.apache.org/LzJd



On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:

>Hi Shaik,
>
>The error basically indicates that namenode crashed waiting for the
>write and sync to happen on the quorum of JournalNodes. In your case
>atleast 2 journal nodes should complete the write and sync without the
>timeout period of 20 seconds which does not seems to be the case.
>
>I will advice you to verify the journal node logs and you should find
>something interesting on them. Maybe some reasons for failures to
>complete the write and sync operation on journal nodes.
>
>
>Regards,
>Gagan Brahmi
>
>On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
>> Hi All,
>>
>> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> Kerberos security.
>>
>> NameNode having  HA and it is crashing at least once in a day with
>>"flush
>> failed for required journal " exception. don't have any network issues
>> between the nodes.
>>
>> I have tried to find the causing the issue,  but, i couldn't able to
>>found
>> proper resolution. Please help me to fix this issue.
>>
>> Thank you,
>> Shaik
>>
>> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> neededReplications = 0, pendingReplications = 0.
>> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>>for
>> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> starting at txid 26198626))
>> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> respond.
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>>AsyncLoggerSet.java:137)
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>>orumOutputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:113)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>>8.apply(JournalSet.java:533)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>>ors(JournalSet.java:393)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>>ava:57)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>>flush(JournalSet.java:529)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>>47)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>>stem.java:3492)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>>eRpcServer.java:787)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>         at
>> 
>>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>>otobufRpcEngine.java:616)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1657)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>>starting
>> at txid 26198626
>> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>>(ExitUtil.java:terminate(124)) -
>> Exiting with status 1
>> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>>(LogAdapter.java:info(47)) -
>> SHUTDOWN_MSG:
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Chris Nauroth <cn...@hortonworks.com>.
A problem I've seen a few times is that slow lookups of the hdfs user's
groups at the JournalNode introduce delays in handling the edit logging
RPC, which then times out at the NameNode side, ultimately causing an
abort and an HA failover.  If your environment is experiencing this, then
you'll see messages in the JournalNode logs about "Potential performance
problem: getGroups".  If this is happening, then there are several
potential fixes.

1. Ultimately, root cause is a performance problem in the infrastructure's
ability to lookup the groups for a user.  This warrants investigation into
whatever that infrastructure is.  (i.e. PAM/LDAP integration with
something like ActiveDirectory is common in a lot of IT shops.)  It's
extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
Service Cache Daemon) to improve performance of group lookups and reduce
load on the infrastructure in this kind of deployment.

2. A potential workaround is to use Hadoop's static group mapping feature
to define the hdfs user's list of groups in configuration.  This way, the
group lookup of the hdfs user performed by the JournalNode never hits the
group lookup infrastructure at all.  The downside is that managing group
memberships in Hadoop configuration files is much more cumbersome than
managing it externally.  For more information, see the documentation of
the configuration property hadoop.user.group.static.mapping.overrides in
core-default.xml. [1]

3. Another potential workaround is to increase the timeouts allowed for
the JournalNode RPC calls.  I haven't had as much success with this
myself, but it's possible.  For more information on how to configure this,
see the documentation of the various dfs.qjournal.*.timeout settings in
hdfs-default.xml. [2]

--Chris Nauroth

[1] https://s.apache.org/kX8D
[2] https://s.apache.org/LzJd



On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:

>Hi Shaik,
>
>The error basically indicates that namenode crashed waiting for the
>write and sync to happen on the quorum of JournalNodes. In your case
>atleast 2 journal nodes should complete the write and sync without the
>timeout period of 20 seconds which does not seems to be the case.
>
>I will advice you to verify the journal node logs and you should find
>something interesting on them. Maybe some reasons for failures to
>complete the write and sync operation on journal nodes.
>
>
>Regards,
>Gagan Brahmi
>
>On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
>> Hi All,
>>
>> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> Kerberos security.
>>
>> NameNode having  HA and it is crashing at least once in a day with
>>"flush
>> failed for required journal " exception. don't have any network issues
>> between the nodes.
>>
>> I have tried to find the causing the issue,  but, i couldn't able to
>>found
>> proper resolution. Please help me to fix this issue.
>>
>> Thank you,
>> Shaik
>>
>> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> neededReplications = 0, pendingReplications = 0.
>> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>>for
>> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> starting at txid 26198626))
>> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> respond.
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>>AsyncLoggerSet.java:137)
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>>orumOutputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:113)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>>8.apply(JournalSet.java:533)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>>ors(JournalSet.java:393)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>>ava:57)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>>flush(JournalSet.java:529)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>>47)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>>stem.java:3492)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>>eRpcServer.java:787)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>         at
>> 
>>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>>otobufRpcEngine.java:616)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1657)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>>starting
>> at txid 26198626
>> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>>(ExitUtil.java:terminate(124)) -
>> Exiting with status 1
>> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>>(LogAdapter.java:info(47)) -
>> SHUTDOWN_MSG:
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Chris Nauroth <cn...@hortonworks.com>.
A problem I've seen a few times is that slow lookups of the hdfs user's
groups at the JournalNode introduce delays in handling the edit logging
RPC, which then times out at the NameNode side, ultimately causing an
abort and an HA failover.  If your environment is experiencing this, then
you'll see messages in the JournalNode logs about "Potential performance
problem: getGroups".  If this is happening, then there are several
potential fixes.

1. Ultimately, root cause is a performance problem in the infrastructure's
ability to lookup the groups for a user.  This warrants investigation into
whatever that infrastructure is.  (i.e. PAM/LDAP integration with
something like ActiveDirectory is common in a lot of IT shops.)  It's
extremely helpful for the nodes of a Hadoop cluster to run nscd (Name
Service Cache Daemon) to improve performance of group lookups and reduce
load on the infrastructure in this kind of deployment.

2. A potential workaround is to use Hadoop's static group mapping feature
to define the hdfs user's list of groups in configuration.  This way, the
group lookup of the hdfs user performed by the JournalNode never hits the
group lookup infrastructure at all.  The downside is that managing group
memberships in Hadoop configuration files is much more cumbersome than
managing it externally.  For more information, see the documentation of
the configuration property hadoop.user.group.static.mapping.overrides in
core-default.xml. [1]

3. Another potential workaround is to increase the timeouts allowed for
the JournalNode RPC calls.  I haven't had as much success with this
myself, but it's possible.  For more information on how to configure this,
see the documentation of the various dfs.qjournal.*.timeout settings in
hdfs-default.xml. [2]

--Chris Nauroth

[1] https://s.apache.org/kX8D
[2] https://s.apache.org/LzJd



On 4/28/16, 7:32 AM, "Gagan Brahmi" <ga...@gmail.com> wrote:

>Hi Shaik,
>
>The error basically indicates that namenode crashed waiting for the
>write and sync to happen on the quorum of JournalNodes. In your case
>atleast 2 journal nodes should complete the write and sync without the
>timeout period of 20 seconds which does not seems to be the case.
>
>I will advice you to verify the journal node logs and you should find
>something interesting on them. Maybe some reasons for failures to
>complete the write and sync operation on journal nodes.
>
>
>Regards,
>Gagan Brahmi
>
>On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
>> Hi All,
>>
>> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
>> Kerberos security.
>>
>> NameNode having  HA and it is crashing at least once in a day with
>>"flush
>> failed for required journal " exception. don't have any network issues
>> between the nodes.
>>
>> I have tried to find the causing the issue,  but, i couldn't able to
>>found
>> proper resolution. Please help me to fix this issue.
>>
>> Thank you,
>> Shaik
>>
>> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:23,483 INFO  BlockStateChange
>> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
>> neededReplications = 0, pendingReplications = 0.
>> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
>> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for
>>a
>> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
>> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
>> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed
>>for
>> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
>> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
>> starting at txid 26198626))
>> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
>> respond.
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(
>>AsyncLoggerSet.java:137)
>>         at
>> 
>>org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(Qu
>>orumOutputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:113)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogO
>>utputStream.java:107)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$
>>8.apply(JournalSet.java:533)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErr
>>ors(JournalSet.java:393)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.j
>>ava:57)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.
>>flush(JournalSet.java:529)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:6
>>47)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesy
>>stem.java:3492)
>>         at
>> 
>>org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNod
>>eRpcServer.java:787)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransla
>>torPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>>         at
>> 
>>org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Client
>>NamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>         at
>> 
>>org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Pr
>>otobufRpcEngine.java:616)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1657)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
>> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
>> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream
>>starting
>> at txid 26198626
>> 2016-04-28 05:05:25,150 INFO  util.ExitUtil
>>(ExitUtil.java:terminate(124)) -
>> Exiting with status 1
>> 2016-04-28 05:05:25,160 INFO  namenode.NameNode
>>(LogAdapter.java:info(47)) -
>> SHUTDOWN_MSG:
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>For additional commands, e-mail: user-help@hadoop.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Gagan Brahmi <ga...@gmail.com>.
Hi Shaik,

The error basically indicates that namenode crashed waiting for the
write and sync to happen on the quorum of JournalNodes. In your case
atleast 2 journal nodes should complete the write and sync without the
timeout period of 20 seconds which does not seems to be the case.

I will advice you to verify the journal node logs and you should find
something interesting on them. Maybe some reasons for failures to
complete the write and sync operation on journal nodes.


Regards,
Gagan Brahmi

On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> Hi All,
>
> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> Kerberos security.
>
> NameNode having  HA and it is crashing at least once in a day with "flush
> failed for required journal " exception. don't have any network issues
> between the nodes.
>
> I have tried to find the causing the issue,  but, i couldn't able to found
> proper resolution. Please help me to fix this issue.
>
> Thank you,
> Shaik
>
> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> neededReplications = 0, pendingReplications = 0.
> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> starting at txid 26198626))
> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> respond.
>         at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>         at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3492)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:787)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting
> at txid 26198626
> 2016-04-28 05:05:25,150 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2016-04-28 05:05:25,160 INFO  namenode.NameNode (LogAdapter.java:info(47)) -
> SHUTDOWN_MSG:
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Gagan Brahmi <ga...@gmail.com>.
Hi Shaik,

The error basically indicates that namenode crashed waiting for the
write and sync to happen on the quorum of JournalNodes. In your case
atleast 2 journal nodes should complete the write and sync without the
timeout period of 20 seconds which does not seems to be the case.

I will advice you to verify the journal node logs and you should find
something interesting on them. Maybe some reasons for failures to
complete the write and sync operation on journal nodes.


Regards,
Gagan Brahmi

On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> Hi All,
>
> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> Kerberos security.
>
> NameNode having  HA and it is crashing at least once in a day with "flush
> failed for required journal " exception. don't have any network issues
> between the nodes.
>
> I have tried to find the causing the issue,  but, i couldn't able to found
> proper resolution. Please help me to fix this issue.
>
> Thank you,
> Shaik
>
> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> neededReplications = 0, pendingReplications = 0.
> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> starting at txid 26198626))
> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> respond.
>         at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>         at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3492)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:787)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting
> at txid 26198626
> 2016-04-28 05:05:25,150 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2016-04-28 05:05:25,160 INFO  namenode.NameNode (LogAdapter.java:info(47)) -
> SHUTDOWN_MSG:
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Gagan Brahmi <ga...@gmail.com>.
Hi Shaik,

The error basically indicates that namenode crashed waiting for the
write and sync to happen on the quorum of JournalNodes. In your case
atleast 2 journal nodes should complete the write and sync without the
timeout period of 20 seconds which does not seems to be the case.

I will advice you to verify the journal node logs and you should find
something interesting on them. Maybe some reasons for failures to
complete the write and sync operation on journal nodes.


Regards,
Gagan Brahmi

On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> Hi All,
>
> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> Kerberos security.
>
> NameNode having  HA and it is crashing at least once in a day with "flush
> failed for required journal " exception. don't have any network issues
> between the nodes.
>
> I have tried to find the causing the issue,  but, i couldn't able to found
> proper resolution. Please help me to fix this issue.
>
> Thank you,
> Shaik
>
> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> neededReplications = 0, pendingReplications = 0.
> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> starting at txid 26198626))
> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> respond.
>         at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>         at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3492)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:787)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting
> at txid 26198626
> 2016-04-28 05:05:25,150 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2016-04-28 05:05:25,160 INFO  namenode.NameNode (LogAdapter.java:info(47)) -
> SHUTDOWN_MSG:
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: NameNode Crashing with "flush failed for required journal" exception

Posted by Gagan Brahmi <ga...@gmail.com>.
Hi Shaik,

The error basically indicates that namenode crashed waiting for the
write and sync to happen on the quorum of JournalNodes. In your case
atleast 2 journal nodes should complete the write and sync without the
timeout period of 20 seconds which does not seems to be the case.

I will advice you to verify the journal node logs and you should find
something interesting on them. Maybe some reasons for failures to
complete the write and sync operation on journal nodes.


Regards,
Gagan Brahmi

On Thu, Apr 28, 2016 at 4:32 AM, Shaik M <mu...@gmail.com> wrote:
> Hi All,
>
> I am running 8 node HDP 2.3 Hadoop Cluster (3 Master+5 DataNodes) with
> Kerberos security.
>
> NameNode having  HA and it is crashing at least once in a day with "flush
> failed for required journal " exception. don't have any network issues
> between the nodes.
>
> I have tried to find the causing the issue,  but, i couldn't able to found
> proper resolution. Please help me to fix this issue.
>
> Thank you,
> Shaik
>
> 2016-04-28 05:05:23,159 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 18015 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:23,483 INFO  BlockStateChange
> (BlockManager.java:computeReplicationWorkForBlocks(1522)) - BLOCK*
> neededReplications = 0, pendingReplications = 0.
> 2016-04-28 05:05:24,160 WARN  client.QuorumJournalManager
> (QuorumCall.java:waitFor(134)) - Waited 19016 ms (timeout=20000 ms) for a
> response for sendEdits. Succeeded so far: [10.192.149.194:8485]
> 2016-04-28 05:05:25,145 FATAL namenode.FSEditLog
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for
> required journal (JournalAndStream(mgr=QJM to [10.192.149.187:8485,
> 10.192.149.195:8485, 10.192.149.194:8485], stream=QuorumOutputStream
> starting at txid 26198626))
> java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
> respond.
>         at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>         at
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>         at
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>         at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3492)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:787)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:536)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2016-04-28 05:05:25,147 WARN  client.QuorumJournalManager
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting
> at txid 26198626
> 2016-04-28 05:05:25,150 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) -
> Exiting with status 1
> 2016-04-28 05:05:25,160 INFO  namenode.NameNode (LogAdapter.java:info(47)) -
> SHUTDOWN_MSG:
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org