You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Quanlong Huang <hu...@gmail.com> on 2019/01/14 03:28:46 UTC

NameNode timeout to writing to QJM after using G1GC in a large heap

Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously
use CMS and 120GB heap size for the NameNode. When we increased the heap
size to 180GB and enable G1GC for one of the NameNode, it crashed after it
became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any
of the JournalNodes. However, no GC pauses were detected at that time.
Resources (CPU, disk/network IO) usage were low. For the group mapping,
we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
connections with LDAP server.

Here're the NameNode logs:


   1. 2019-01-09 18:54:07,788 INFO
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
   2. 2019-01-09 18:54:10,042 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
6001 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   3. 2019-01-09 18:54:11,043 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
7003 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   4. 2019-01-09 18:54:12,045 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
8004 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   5. 2019-01-09 18:54:13,046 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
9005 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   6. 2019-01-09 18:54:14,047 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
10007 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   7. 2019-01-09 18:54:15,049 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
11008 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   8. 2019-01-09 18:54:16,050 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
12009 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   9. 2019-01-09 18:54:17,050 INFO
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
13010 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   10. 2019-01-09 18:54:18,052 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
14011 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   11. 2019-01-09 18:54:19,053 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
15012 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   12. 2019-01-09 18:54:20,053 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
16013 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   13. 2019-01-09 18:54:21,055 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
17014 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   14. 2019-01-09 18:54:22,056 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
18015 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   15. 2019-01-09 18:54:23,056 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited
19016 ms (timeout=20000 ms) for a response for sendEdits. No responses
yet.
   16. 2019-01-09 18:54:24,041 FATAL
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed
for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485,
10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream
starting at txid 46729238851))
   17. java.io.IOException: Timed out waiting 20000ms for a quorum of
nodes to respond.
   18.         at
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
   19.         at
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
   20.         at
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
   21.         at
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
   22.         at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
   23.         at
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
   24.         at
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
   25.         at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
   26.         at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
   27.         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
   28.         at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
   29.         at
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
   30.         at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
   31.         at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   32.         at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
   33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
   34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
   35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
   36.         at java.security.AccessController.doPrivileged(Native Method)
   37.         at javax.security.auth.Subject.doAs(Subject.java:422)
   38.         at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
   39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
   40. 2019-01-09 18:54:24,042 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting
QuorumOutputStream starting at txid 46729238851
   41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil:
Exiting with status 1
   42. 2019-01-09 18:54:42,494 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
   43. /************************************************************
   44. SHUTDOWN_MSG: Shutting down NameNode at
metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
   45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:


   1. 2019-01-09 18:50:44,772 INFO
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host
machine (eg GC): pause of approximately 1997ms
   2. No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It
seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:


   1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32
-XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc

Could anyone give some thoughts?

Thanks,
Quanlong Huang

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

The JvmPauseMonitor warning "No GCs detected" usually suggests the host is
running out of resources. I've seen it happening on hosts that run other
master servies in addition to NameNodes.

I've also seen network disconnection (somebody restarted network switch)
that results in the QJM timeout messages like that. I've also seen cases
where everything works perfectly fine, but the Linux kernel decides not to
send the response from JN back to the NN process.

If this happens again, try bump the QJM timeouts from 20 seconds to 40
seconds.

On Sun, Jan 13, 2019 at 8:29 PM Quanlong Huang <hu...@gmail.com>
wrote:

> Hi all,
>
> Our HDFS has HA setting which depends on QuorumJournalNodes. We previously
> use CMS and 120GB heap size for the NameNode. When we increased the heap
> size to 180GB and enable G1GC for one of the NameNode, it crashed after it
> became active for ~10 minutes.
>
> The crash is due to a FATAL error that the NameNode timeout to write to
> any of the JournalNodes. However, no GC pauses were detected at that time.
> Resources (CPU, disk/network IO) usage were low. For the group mapping,
> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
> connections with LDAP server.
>
> Here're the NameNode logs:
>
>
>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>    36.         at java.security.AccessController.doPrivileged(Native Method)
>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>    43. /************************************************************
>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13
>    45. ************************************************************/
>
> Only one complain of JvmPauseMonitor was shown far early before the
> timeout:
>
>
>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>    2. No GCs detected
>
> In the QuaromJournalNodes, no warning logs are found in that period. It
> seems the NameNode fails to connect to any of the JournalNodes.
>
> Here're the GC options we set for G1GC:
>
>
>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>
> Could anyone give some thoughts?
>
> Thanks,
> Quanlong Huang
>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Quanlong Huang <hu...@gmail.com>.

Hi all,

Thanks for your helps! We finally found that we missed some logs after the
SHUTDOWN_MSG. While the NameNode is shutting down the QJM finished its
writes:

2019-01-09 18:54:42,614 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 38573ms
to send a batch of 11 edits (1261 bytes) to remote journal 10.88.135.29:8485
2019-01-09 18:54:42,639 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 38598ms
to send a batch of 11 edits (1261 bytes) to remote journal 10.88.131.30:8485
2019-01-09 18:54:42,639 WARN
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 38598ms
to send a batch of 11 edits (1261 bytes) to remote journal 10.88.133.30:8485

We decided to increase the timeout to 60s as you suggested.

About my question of NameNode scalability improvements in later CDH release
(>5.7.3), we found the correct hadoop repo of CDH is
https://github.com/cloudera/hadoop-common
There're some important improvements like HDFS-6763 and HDFS-8865.

Thanks,
Quanlong

On Fri, Jan 18, 2019 at 10:11 AM Quanlong Huang <hu...@gmail.com>
wrote:

> Hi friends, thanks for your reply!
>
> >Are you using JDK1.7 or 1.8? The G1GC works better on JDK1.8. From
> Cloudera's experience, we don't recommend G1GC on JDK1.7.
>
> No, as I mentioned in the previous email, we are using Oracle jdk
> 1.8.0_121. I think G1GC is ok for our jdk.
>
> >the CDH5.7.3 is quite old, and a number of NameNode scalability
> improvements were added in later releases.
>
> Could you help to pick up some important patches/JIRAs in later releases?
> I've gone through the release notes of CDH 5.8 ~ 5.16 (
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_new_features.html)
> but found nothing about scalability improvements. The hadoop repos in
> https://github.com/cloudera are not updated for a long time so we don't
> know the actual changes in later releases.
>
> Thanks for the mention of Dynamometer! It looks cool. We'll have a try on
> it!
>
> Thanks,
> Quanlong
>
> On Fri, Jan 18, 2019 at 1:14 AM Wei-Chiu Chuang <we...@cloudera.com>
> wrote:
>
>> As a developer at Cloudera, the CDH5.7.3 is quite old, and a number of
>> NameNode scalability improvements were added in later releases.
>>
>> Are you using JDK1.7 or 1.8? The G1GC works better on JDK1.8. From
>> Cloudera's experience, we don't recommend G1GC on JDK1.7.
>>
>> Re: Dynamometer question: We have an internal fork of Dynamometer for
>> CDH/CM because of the way it is designed is not very compatible with
>> Dynamometer out of box.
>> We wanted to add the metrics into CM, but CM's metrics monitoring system
>> is too bulky and hard to change.
>>
>>
>> On Wed, Jan 16, 2019 at 4:11 AM Quanlong Huang <hu...@gmail.com>
>> wrote:
>>
>>> Hi Wei-Chiu, Sajid and Erik,
>>>
>>> Great thanks for all your responses! Please forgive me to reply in one
>>> email in order to summarize our environment.
>>>
>>> We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster
>>> with >300M blocks. NameNodes work normally using CMS in a 120GB heap and
>>> never timeout writing to JournalNodes. So I think current timeout setting
>>> (20s) for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu,
>>> @Sajid) Please correct me if it's possible to have poorer performance using
>>> G1GC in a 180GB heap than using CMS in a 120GB heap.
>>>
>>> Thanks for Erik's GC configs and we'll investigate it. However, It's a
>>> production cluster with heavy workloads. We want to be clear of the crash
>>> before trying again G1GC or any new GC configs.
>>>
>>> I have some follow-up questions:
>>>
>>> a) Are there any tools in the community used to benchmark the NameNode?
>>> So we can use them to simulate the heavy workloads and reproduce the crash.
>>>
>>> b) Is it possible that the timeout is due to RPC resource starvation?
>>> Does code paths writing to JournalNodes share anything (e.g. thread pool)
>>> with code paths processing clients/Datanodes RPC requests?
>>> The NameNode is deployed in a dedicated machine. Hardware resource
>>> usages (e.g. CPU, network/disk IO) are low in the timeout period. Only the
>>> RPC metrics looks abnormal. However, it may be the cause or the result.
>>> Here's a snapshot of the RPC metrics:
>>> https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing
>>>
>>> c) Are there any retry policies for the NameNode to write to
>>> JournalNodes?
>>> In the logs, it looks like NameNode does not retry but waiting for 20
>>> seconds.
>>>
>>> We'd be appreciated if any of these questions could have your answer!
>>>
>>> Thanks,
>>> Quanlong
>>>
>>> On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <ek...@linkedin.com>
>>> wrote:
>>>
>>>> In my personal experience with MaxGCPauseMillis, it did not have a
>>>> strong effect on the GC pause behavior, but this was on a slightly old JVM
>>>> (I believe 8u31). I found that the most important setting for large
>>>> clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries,
>>>> which was necessary to bump to about 2-3x the default to avoid lots of huge
>>>> (>20 sec) pauses during periods of high churn like startup and failover. I
>>>> still don't understand how the NameNode could have continued to produce
>>>> logs if a GC pause really was the cause of your issue, but maybe you can
>>>> try adjusting this.
>>>> ------------------------------
>>>> *From:* Quanlong Huang <hu...@gmail.com>
>>>> *Sent:* Monday, January 14, 2019 6:02:12 PM
>>>> *To:* Erik Krogen
>>>> *Cc:* user@hadoop.apache.org; xicheng.dong@hulu.com;
>>>> anning.luo@hulu.com; chang.wu@hulu.com
>>>> *Subject:* Re: NameNode timeout to writing to QJM after using G1GC in
>>>> a large heap
>>>>
>>>> Hi Erik,
>>>>
>>>> Thanks for your response! However, it's quite a pity that the GC log
>>>> was overwritten after NameNode restart... We should have set the GC log
>>>> file name to "gc-%t.log" instead of "gc.log".
>>>>
>>>> One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog
>>>> of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M
>>>> files):
>>>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html
>>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.hortonworks.com%2FHDPDocuments%2FHDP2%2FHDP-2.6.5%2Fbk_hdfs-administration%2Fcontent%2Fch_g1gc_garbage_collector_tech_preview.html&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826528673&sdata=1SpdjC4KY4GCAskKzs4jyn43o3vhGIdEjiPAI6F%2FWMk%3D&reserved=0>
>>>>
>>>> Our cluster is much larger than the threshold (50M files) but we set
>>>> MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to
>>>> write to JournalNodes. Do you have any suggestions for this?
>>>>
>>>> Thanks,
>>>> Quanlong
>>>>
>>>> On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com>
>>>> wrote:
>>>>
>>>> Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log)
>>>> is usually a better idea than relying on the pause monitor in my experience.
>>>>
>>>>
>>>> That being said, given that the NN continued to produce log messages
>>>> during this time, it doesn't seem to be a GC pause issue.
>>>> ------------------------------
>>>> *From:* Quanlong Huang <hu...@gmail.com>
>>>> *Sent:* Sunday, January 13, 2019 7:28:46 PM
>>>> *To:* user@hadoop.apache.org
>>>> *Cc:* xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
>>>> *Subject:* NameNode timeout to writing to QJM after using G1GC in a
>>>> large heap
>>>>
>>>> Hi all,
>>>>
>>>> Our HDFS has HA setting which depends on QuorumJournalNodes. We
>>>> previously use CMS and 120GB heap size for the NameNode. When we increased
>>>> the heap size to 180GB and enable G1GC for one of the NameNode, it crashed
>>>> after it became active for ~10 minutes.
>>>>
>>>> The crash is due to a FATAL error that the NameNode timeout to write to
>>>> any of the JournalNodes. However, no GC pauses were detected at that time.
>>>> Resources (CPU, disk/network IO) usage were low. For the group mapping,
>>>> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
>>>> connections with LDAP server.
>>>>
>>>> Here're the NameNode logs:
>>>>
>>>>
>>>>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>>>>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>>>>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>>>>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>>>>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>>>>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>>>>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>>>>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>>>>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>>>>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>>>>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>>>>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>>>>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>>>>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>>>>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>>>>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>>>>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>>>>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>>>>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>>>>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>>>>    36.         at java.security.AccessController.doPrivileged(Native Method)
>>>>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>>>>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>>>>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>>>>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>>>>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>>>    43. /************************************************************
>>>>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826538682&sdata=2yqx%2B5jrms%2Fx%2FZQgB95I1FT9UKr3MlXCm5s%2FyoFrQhE%3D&reserved=0>
>>>>    45. ************************************************************/
>>>>
>>>> Only one complain of JvmPauseMonitor was shown far early before the
>>>> timeout:
>>>>
>>>>
>>>>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>>>>    2. No GCs detected
>>>>
>>>> In the QuaromJournalNodes, no warning logs are found in that period. It
>>>> seems the NameNode fails to connect to any of the JournalNodes.
>>>>
>>>> Here're the GC options we set for G1GC:
>>>>
>>>>
>>>>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>>>>
>>>> Could anyone give some thoughts?
>>>>
>>>> Thanks,
>>>> Quanlong Huang
>>>>
>>>>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Quanlong Huang <hu...@gmail.com>.

Hi friends, thanks for your reply!

>Are you using JDK1.7 or 1.8? The G1GC works better on JDK1.8. From
Cloudera's experience, we don't recommend G1GC on JDK1.7.

No, as I mentioned in the previous email, we are using Oracle jdk
1.8.0_121. I think G1GC is ok for our jdk.

>the CDH5.7.3 is quite old, and a number of NameNode scalability
improvements were added in later releases.

Could you help to pick up some important patches/JIRAs in later releases?
I've gone through the release notes of CDH 5.8 ~ 5.16 (
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_new_features.html)
but found nothing about scalability improvements. The hadoop repos in
https://github.com/cloudera are not updated for a long time so we don't
know the actual changes in later releases.

Thanks for the mention of Dynamometer! It looks cool. We'll have a try on
it!

Thanks,
Quanlong

On Fri, Jan 18, 2019 at 1:14 AM Wei-Chiu Chuang <we...@cloudera.com>
wrote:

> As a developer at Cloudera, the CDH5.7.3 is quite old, and a number of
> NameNode scalability improvements were added in later releases.
>
> Are you using JDK1.7 or 1.8? The G1GC works better on JDK1.8. From
> Cloudera's experience, we don't recommend G1GC on JDK1.7.
>
> Re: Dynamometer question: We have an internal fork of Dynamometer for
> CDH/CM because of the way it is designed is not very compatible with
> Dynamometer out of box.
> We wanted to add the metrics into CM, but CM's metrics monitoring system
> is too bulky and hard to change.
>
>
> On Wed, Jan 16, 2019 at 4:11 AM Quanlong Huang <hu...@gmail.com>
> wrote:
>
>> Hi Wei-Chiu, Sajid and Erik,
>>
>> Great thanks for all your responses! Please forgive me to reply in one
>> email in order to summarize our environment.
>>
>> We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster
>> with >300M blocks. NameNodes work normally using CMS in a 120GB heap and
>> never timeout writing to JournalNodes. So I think current timeout setting
>> (20s) for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu,
>> @Sajid) Please correct me if it's possible to have poorer performance using
>> G1GC in a 180GB heap than using CMS in a 120GB heap.
>>
>> Thanks for Erik's GC configs and we'll investigate it. However, It's a
>> production cluster with heavy workloads. We want to be clear of the crash
>> before trying again G1GC or any new GC configs.
>>
>> I have some follow-up questions:
>>
>> a) Are there any tools in the community used to benchmark the NameNode?
>> So we can use them to simulate the heavy workloads and reproduce the crash.
>>
>> b) Is it possible that the timeout is due to RPC resource starvation?
>> Does code paths writing to JournalNodes share anything (e.g. thread pool)
>> with code paths processing clients/Datanodes RPC requests?
>> The NameNode is deployed in a dedicated machine. Hardware resource usages
>> (e.g. CPU, network/disk IO) are low in the timeout period. Only the RPC
>> metrics looks abnormal. However, it may be the cause or the result. Here's
>> a snapshot of the RPC metrics:
>> https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing
>>
>> c) Are there any retry policies for the NameNode to write to JournalNodes?
>> In the logs, it looks like NameNode does not retry but waiting for 20
>> seconds.
>>
>> We'd be appreciated if any of these questions could have your answer!
>>
>> Thanks,
>> Quanlong
>>
>> On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <ek...@linkedin.com>
>> wrote:
>>
>>> In my personal experience with MaxGCPauseMillis, it did not have a
>>> strong effect on the GC pause behavior, but this was on a slightly old JVM
>>> (I believe 8u31). I found that the most important setting for large
>>> clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries,
>>> which was necessary to bump to about 2-3x the default to avoid lots of huge
>>> (>20 sec) pauses during periods of high churn like startup and failover. I
>>> still don't understand how the NameNode could have continued to produce
>>> logs if a GC pause really was the cause of your issue, but maybe you can
>>> try adjusting this.
>>> ------------------------------
>>> *From:* Quanlong Huang <hu...@gmail.com>
>>> *Sent:* Monday, January 14, 2019 6:02:12 PM
>>> *To:* Erik Krogen
>>> *Cc:* user@hadoop.apache.org; xicheng.dong@hulu.com; anning.luo@hulu.com;
>>> chang.wu@hulu.com
>>> *Subject:* Re: NameNode timeout to writing to QJM after using G1GC in a
>>> large heap
>>>
>>> Hi Erik,
>>>
>>> Thanks for your response! However, it's quite a pity that the GC log was
>>> overwritten after NameNode restart... We should have set the GC log file
>>> name to "gc-%t.log" instead of "gc.log".
>>>
>>> One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog
>>> of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M
>>> files):
>>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html
>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.hortonworks.com%2FHDPDocuments%2FHDP2%2FHDP-2.6.5%2Fbk_hdfs-administration%2Fcontent%2Fch_g1gc_garbage_collector_tech_preview.html&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826528673&sdata=1SpdjC4KY4GCAskKzs4jyn43o3vhGIdEjiPAI6F%2FWMk%3D&reserved=0>
>>>
>>> Our cluster is much larger than the threshold (50M files) but we set
>>> MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to
>>> write to JournalNodes. Do you have any suggestions for this?
>>>
>>> Thanks,
>>> Quanlong
>>>
>>> On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com>
>>> wrote:
>>>
>>> Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log)
>>> is usually a better idea than relying on the pause monitor in my experience.
>>>
>>>
>>> That being said, given that the NN continued to produce log messages
>>> during this time, it doesn't seem to be a GC pause issue.
>>> ------------------------------
>>> *From:* Quanlong Huang <hu...@gmail.com>
>>> *Sent:* Sunday, January 13, 2019 7:28:46 PM
>>> *To:* user@hadoop.apache.org
>>> *Cc:* xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
>>> *Subject:* NameNode timeout to writing to QJM after using G1GC in a
>>> large heap
>>>
>>> Hi all,
>>>
>>> Our HDFS has HA setting which depends on QuorumJournalNodes. We
>>> previously use CMS and 120GB heap size for the NameNode. When we increased
>>> the heap size to 180GB and enable G1GC for one of the NameNode, it crashed
>>> after it became active for ~10 minutes.
>>>
>>> The crash is due to a FATAL error that the NameNode timeout to write to
>>> any of the JournalNodes. However, no GC pauses were detected at that time.
>>> Resources (CPU, disk/network IO) usage were low. For the group mapping,
>>> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
>>> connections with LDAP server.
>>>
>>> Here're the NameNode logs:
>>>
>>>
>>>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>>>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>>>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>>>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>>>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>>>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>>>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>>>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>>>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>>>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>>>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>>>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>>>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>>>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>>>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>>>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>>>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>>>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>>>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>>>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>>>    36.         at java.security.AccessController.doPrivileged(Native Method)
>>>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>>>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>>>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>>>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>>>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>>    43. /************************************************************
>>>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826538682&sdata=2yqx%2B5jrms%2Fx%2FZQgB95I1FT9UKr3MlXCm5s%2FyoFrQhE%3D&reserved=0>
>>>    45. ************************************************************/
>>>
>>> Only one complain of JvmPauseMonitor was shown far early before the
>>> timeout:
>>>
>>>
>>>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>>>    2. No GCs detected
>>>
>>> In the QuaromJournalNodes, no warning logs are found in that period. It
>>> seems the NameNode fails to connect to any of the JournalNodes.
>>>
>>> Here're the GC options we set for G1GC:
>>>
>>>
>>>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>>>
>>> Could anyone give some thoughts?
>>>
>>> Thanks,
>>> Quanlong Huang
>>>
>>>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Wei-Chiu Chuang <we...@cloudera.com.INVALID>.

As a developer at Cloudera, the CDH5.7.3 is quite old, and a number of
NameNode scalability improvements were added in later releases.

Are you using JDK1.7 or 1.8? The G1GC works better on JDK1.8. From
Cloudera's experience, we don't recommend G1GC on JDK1.7.

Re: Dynamometer question: We have an internal fork of Dynamometer for
CDH/CM because of the way it is designed is not very compatible with
Dynamometer out of box.
We wanted to add the metrics into CM, but CM's metrics monitoring system is
too bulky and hard to change.


On Wed, Jan 16, 2019 at 4:11 AM Quanlong Huang <hu...@gmail.com>
wrote:

> Hi Wei-Chiu, Sajid and Erik,
>
> Great thanks for all your responses! Please forgive me to reply in one
> email in order to summarize our environment.
>
> We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster with
> >300M blocks. NameNodes work normally using CMS in a 120GB heap and never
> timeout writing to JournalNodes. So I think current timeout setting (20s)
> for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu, @Sajid)
> Please correct me if it's possible to have poorer performance using G1GC in
> a 180GB heap than using CMS in a 120GB heap.
>
> Thanks for Erik's GC configs and we'll investigate it. However, It's a
> production cluster with heavy workloads. We want to be clear of the crash
> before trying again G1GC or any new GC configs.
>
> I have some follow-up questions:
>
> a) Are there any tools in the community used to benchmark the NameNode? So
> we can use them to simulate the heavy workloads and reproduce the crash.
>
> b) Is it possible that the timeout is due to RPC resource starvation? Does
> code paths writing to JournalNodes share anything (e.g. thread pool) with
> code paths processing clients/Datanodes RPC requests?
> The NameNode is deployed in a dedicated machine. Hardware resource usages
> (e.g. CPU, network/disk IO) are low in the timeout period. Only the RPC
> metrics looks abnormal. However, it may be the cause or the result. Here's
> a snapshot of the RPC metrics:
> https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing
>
> c) Are there any retry policies for the NameNode to write to JournalNodes?
> In the logs, it looks like NameNode does not retry but waiting for 20
> seconds.
>
> We'd be appreciated if any of these questions could have your answer!
>
> Thanks,
> Quanlong
>
> On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <ek...@linkedin.com> wrote:
>
>> In my personal experience with MaxGCPauseMillis, it did not have a strong
>> effect on the GC pause behavior, but this was on a slightly old JVM (I
>> believe 8u31). I found that the most important setting for large clusters
>> (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries,
>> which was necessary to bump to about 2-3x the default to avoid lots of huge
>> (>20 sec) pauses during periods of high churn like startup and failover. I
>> still don't understand how the NameNode could have continued to produce
>> logs if a GC pause really was the cause of your issue, but maybe you can
>> try adjusting this.
>> ------------------------------
>> *From:* Quanlong Huang <hu...@gmail.com>
>> *Sent:* Monday, January 14, 2019 6:02:12 PM
>> *To:* Erik Krogen
>> *Cc:* user@hadoop.apache.org; xicheng.dong@hulu.com; anning.luo@hulu.com;
>> chang.wu@hulu.com
>> *Subject:* Re: NameNode timeout to writing to QJM after using G1GC in a
>> large heap
>>
>> Hi Erik,
>>
>> Thanks for your response! However, it's quite a pity that the GC log was
>> overwritten after NameNode restart... We should have set the GC log file
>> name to "gc-%t.log" instead of "gc.log".
>>
>> One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of
>> Hortonworks, it's suggested to set it to 4000 for large clusters (>50M
>> files):
>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.hortonworks.com%2FHDPDocuments%2FHDP2%2FHDP-2.6.5%2Fbk_hdfs-administration%2Fcontent%2Fch_g1gc_garbage_collector_tech_preview.html&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826528673&sdata=1SpdjC4KY4GCAskKzs4jyn43o3vhGIdEjiPAI6F%2FWMk%3D&reserved=0>
>>
>> Our cluster is much larger than the threshold (50M files) but we set
>> MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to
>> write to JournalNodes. Do you have any suggestions for this?
>>
>> Thanks,
>> Quanlong
>>
>> On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com>
>> wrote:
>>
>> Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log)
>> is usually a better idea than relying on the pause monitor in my experience.
>>
>>
>> That being said, given that the NN continued to produce log messages
>> during this time, it doesn't seem to be a GC pause issue.
>> ------------------------------
>> *From:* Quanlong Huang <hu...@gmail.com>
>> *Sent:* Sunday, January 13, 2019 7:28:46 PM
>> *To:* user@hadoop.apache.org
>> *Cc:* xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
>> *Subject:* NameNode timeout to writing to QJM after using G1GC in a
>> large heap
>>
>> Hi all,
>>
>> Our HDFS has HA setting which depends on QuorumJournalNodes. We
>> previously use CMS and 120GB heap size for the NameNode. When we increased
>> the heap size to 180GB and enable G1GC for one of the NameNode, it crashed
>> after it became active for ~10 minutes.
>>
>> The crash is due to a FATAL error that the NameNode timeout to write to
>> any of the JournalNodes. However, no GC pauses were detected at that time.
>> Resources (CPU, disk/network IO) usage were low. For the group mapping,
>> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
>> connections with LDAP server.
>>
>> Here're the NameNode logs:
>>
>>
>>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>>    36.         at java.security.AccessController.doPrivileged(Native Method)
>>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>    43. /************************************************************
>>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826538682&sdata=2yqx%2B5jrms%2Fx%2FZQgB95I1FT9UKr3MlXCm5s%2FyoFrQhE%3D&reserved=0>
>>    45. ************************************************************/
>>
>> Only one complain of JvmPauseMonitor was shown far early before the
>> timeout:
>>
>>
>>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>>    2. No GCs detected
>>
>> In the QuaromJournalNodes, no warning logs are found in that period. It
>> seems the NameNode fails to connect to any of the JournalNodes.
>>
>> Here're the GC options we set for G1GC:
>>
>>
>>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>>
>> Could anyone give some thoughts?
>>
>> Thanks,
>> Quanlong Huang
>>
>>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Quanlong Huang <hu...@gmail.com>.

Hi Wei-Chiu, Sajid and Erik,

Great thanks for all your responses! Please forgive me to reply in one
email in order to summarize our environment.

We're using hadoop 2.6-cdh5.7.3 and oracle jdk 1.8.0_121 in a cluster with
>300M blocks. NameNodes work normally using CMS in a 120GB heap and never
timeout writing to JournalNodes. So I think current timeout setting (20s)
for QJM should be ok for using G1GC in a 180GB heap. (@Wei-Chiu, @Sajid)
Please correct me if it's possible to have poorer performance using G1GC in
a 180GB heap than using CMS in a 120GB heap.

Thanks for Erik's GC configs and we'll investigate it. However, It's a
production cluster with heavy workloads. We want to be clear of the crash
before trying again G1GC or any new GC configs.

I have some follow-up questions:

a) Are there any tools in the community used to benchmark the NameNode? So
we can use them to simulate the heavy workloads and reproduce the crash.

b) Is it possible that the timeout is due to RPC resource starvation? Does
code paths writing to JournalNodes share anything (e.g. thread pool) with
code paths processing clients/Datanodes RPC requests?
The NameNode is deployed in a dedicated machine. Hardware resource usages
(e.g. CPU, network/disk IO) are low in the timeout period. Only the RPC
metrics looks abnormal. However, it may be the cause or the result. Here's
a snapshot of the RPC metrics:
https://drive.google.com/file/d/1_ZuJnURIVv-nb5B9_Lj7bcxt8oPabRn6/view?usp=sharing

c) Are there any retry policies for the NameNode to write to JournalNodes?
In the logs, it looks like NameNode does not retry but waiting for 20
seconds.

We'd be appreciated if any of these questions could have your answer!

Thanks,
Quanlong

On Wed, Jan 16, 2019 at 12:17 AM Erik Krogen <ek...@linkedin.com> wrote:

> In my personal experience with MaxGCPauseMillis, it did not have a strong
> effect on the GC pause behavior, but this was on a slightly old JVM (I
> believe 8u31). I found that the most important setting for large clusters
> (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries,
> which was necessary to bump to about 2-3x the default to avoid lots of huge
> (>20 sec) pauses during periods of high churn like startup and failover. I
> still don't understand how the NameNode could have continued to produce
> logs if a GC pause really was the cause of your issue, but maybe you can
> try adjusting this.
> ------------------------------
> *From:* Quanlong Huang <hu...@gmail.com>
> *Sent:* Monday, January 14, 2019 6:02:12 PM
> *To:* Erik Krogen
> *Cc:* user@hadoop.apache.org; xicheng.dong@hulu.com; anning.luo@hulu.com;
> chang.wu@hulu.com
> *Subject:* Re: NameNode timeout to writing to QJM after using G1GC in a
> large heap
>
> Hi Erik,
>
> Thanks for your response! However, it's quite a pity that the GC log was
> overwritten after NameNode restart... We should have set the GC log file
> name to "gc-%t.log" instead of "gc.log".
>
> One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of
> Hortonworks, it's suggested to set it to 4000 for large clusters (>50M
> files):
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.hortonworks.com%2FHDPDocuments%2FHDP2%2FHDP-2.6.5%2Fbk_hdfs-administration%2Fcontent%2Fch_g1gc_garbage_collector_tech_preview.html&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826528673&sdata=1SpdjC4KY4GCAskKzs4jyn43o3vhGIdEjiPAI6F%2FWMk%3D&reserved=0>
>
> Our cluster is much larger than the threshold (50M files) but we set
> MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to
> write to JournalNodes. Do you have any suggestions for this?
>
> Thanks,
> Quanlong
>
> On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com> wrote:
>
> Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is
> usually a better idea than relying on the pause monitor in my experience.
>
>
> That being said, given that the NN continued to produce log messages
> during this time, it doesn't seem to be a GC pause issue.
> ------------------------------
> *From:* Quanlong Huang <hu...@gmail.com>
> *Sent:* Sunday, January 13, 2019 7:28:46 PM
> *To:* user@hadoop.apache.org
> *Cc:* xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
> *Subject:* NameNode timeout to writing to QJM after using G1GC in a large
> heap
>
> Hi all,
>
> Our HDFS has HA setting which depends on QuorumJournalNodes. We previously
> use CMS and 120GB heap size for the NameNode. When we increased the heap
> size to 180GB and enable G1GC for one of the NameNode, it crashed after it
> became active for ~10 minutes.
>
> The crash is due to a FATAL error that the NameNode timeout to write to
> any of the JournalNodes. However, no GC pauses were detected at that time.
> Resources (CPU, disk/network IO) usage were low. For the group mapping,
> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
> connections with LDAP server.
>
> Here're the NameNode logs:
>
>
>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>    36.         at java.security.AccessController.doPrivileged(Native Method)
>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>    43. /************************************************************
>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826538682&sdata=2yqx%2B5jrms%2Fx%2FZQgB95I1FT9UKr3MlXCm5s%2FyoFrQhE%3D&reserved=0>
>    45. ************************************************************/
>
> Only one complain of JvmPauseMonitor was shown far early before the
> timeout:
>
>
>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>    2. No GCs detected
>
> In the QuaromJournalNodes, no warning logs are found in that period. It
> seems the NameNode fails to connect to any of the JournalNodes.
>
> Here're the GC options we set for G1GC:
>
>
>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>
> Could anyone give some thoughts?
>
> Thanks,
> Quanlong Huang
>
>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Erik Krogen <ek...@linkedin.com>.

In my personal experience with MaxGCPauseMillis, it did not have a strong effect on the GC pause behavior, but this was on a slightly old JVM (I believe 8u31). I found that the most important setting for large clusters (in our case, hundreds of millions of files/blocks) was the G1RSetRegionEntries, which was necessary to bump to about 2-3x the default to avoid lots of huge (>20 sec) pauses during periods of high churn like startup and failover. I still don't understand how the NameNode could have continued to produce logs if a GC pause really was the cause of your issue, but maybe you can try adjusting this.

________________________________
From: Quanlong Huang <hu...@gmail.com>
Sent: Monday, January 14, 2019 6:02:12 PM
To: Erik Krogen
Cc: user@hadoop.apache.org; xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
Subject: Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was overwritten after NameNode restart... We should have set the GC log file name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of Hortonworks, it's suggested to set it to 4000 for large clusters (>50M files): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.hortonworks.com%2FHDPDocuments%2FHDP2%2FHDP-2.6.5%2Fbk_hdfs-administration%2Fcontent%2Fch_g1gc_garbage_collector_tech_preview.html&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826528673&sdata=1SpdjC4KY4GCAskKzs4jyn43o3vhGIdEjiPAI6F%2FWMk%3D&reserved=0>

Our cluster is much larger than the threshold (50M files) but we set MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com>> wrote:

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.

That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.

________________________________
From: Quanlong Huang <hu...@gmail.com>>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Cc: xicheng.dong@hulu.com<ma...@hulu.com>; anning.luo@hulu.com<ma...@hulu.com>; chang.wu@hulu.com<ma...@hulu.com>
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap

Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:

  1.  2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2.  2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3.  2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4.  2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5.  2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6.  2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7.  2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8.  2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9.  2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36.         at java.security.AccessController.doPrivileged(Native Method)
  37.         at javax.security.auth.Subject.doAs(Subject.java:422)
  38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7Caf9c8ed6e30f4e8917d608d67a8d92a6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636831145826538682&sdata=2yqx%2B5jrms%2Fx%2FZQgB95I1FT9UKr3MlXCm5s%2FyoFrQhE%3D&reserved=0>
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1.  2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2.  No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1.  -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc

Could anyone give some thoughts?

Thanks,
Quanlong Huang

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Quanlong Huang <hu...@gmail.com>.

Hi Erik,

Thanks for your response! However, it's quite a pity that the GC log was
overwritten after NameNode restart... We should have set the GC log file
name to "gc-%t.log" instead of "gc.log".

One of my suspects is the GC option "MaxGCPauseMillis". In a tech blog of
Hortonworks, it's suggested to set it to 4000 for large clusters (>50M
files):
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html

Our cluster is much larger than the threshold (50M files) but we set
MaxGCPauseMillis to 400. I'm not sure how it can cause NameNode timeout to
write to JournalNodes. Do you have any suggestions for this?

Thanks,
Quanlong

On Mon, Jan 14, 2019 at 10:53 PM Erik Krogen <ek...@linkedin.com> wrote:

> Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is
> usually a better idea than relying on the pause monitor in my experience.
>
>
> That being said, given that the NN continued to produce log messages
> during this time, it doesn't seem to be a GC pause issue.
> ------------------------------
> *From:* Quanlong Huang <hu...@gmail.com>
> *Sent:* Sunday, January 13, 2019 7:28:46 PM
> *To:* user@hadoop.apache.org
> *Cc:* xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
> *Subject:* NameNode timeout to writing to QJM after using G1GC in a large
> heap
>
> Hi all,
>
> Our HDFS has HA setting which depends on QuorumJournalNodes. We previously
> use CMS and 120GB heap size for the NameNode. When we increased the heap
> size to 180GB and enable G1GC for one of the NameNode, it crashed after it
> became active for ~10 minutes.
>
> The crash is due to a FATAL error that the NameNode timeout to write to
> any of the JournalNodes. However, no GC pauses were detected at that time.
> Resources (CPU, disk/network IO) usage were low. For the group mapping,
> we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in
> connections with LDAP server.
>
> Here're the NameNode logs:
>
>
>    1. 2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
>    2. 2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    3. 2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    4. 2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    5. 2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    6. 2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    7. 2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    8. 2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    9. 2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
>    16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
>    17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
>    18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
>    19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
>    20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
>    21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
>    22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
>    23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
>    24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
>    25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
>    26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
>    27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
>    28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
>    29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
>    30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
>    31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>    32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>    33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>    34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>    35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>    36.         at java.security.AccessController.doPrivileged(Native Method)
>    37.         at javax.security.auth.Subject.doAs(Subject.java:422)
>    38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
>    39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>    40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
>    41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
>    42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>    43. /************************************************************
>    44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7C33fe654bcef9482fc73608d679d088da%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636830333907629483&sdata=9thxssrkjHHf4OG5FFxIE2mgBVLuU4hpfghrdt0OnS8%3D&reserved=0>
>    45. ************************************************************/
>
> Only one complain of JvmPauseMonitor was shown far early before the
> timeout:
>
>
>    1. 2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
>    2. No GCs detected
>
> In the QuaromJournalNodes, no warning logs are found in that period. It
> seems the NameNode fails to connect to any of the JournalNodes.
>
> Here're the GC options we set for G1GC:
>
>
>    1. -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc
>
> Could anyone give some thoughts?
>
> Thanks,
> Quanlong Huang
>

Re: NameNode timeout to writing to QJM after using G1GC in a large heap

Posted by Erik Krogen <ek...@linkedin.com>.

Checking your GC log file (looks like it's /var/log/hadoop-hdfs/gc.log) is usually a better idea than relying on the pause monitor in my experience.


That being said, given that the NN continued to produce log messages during this time, it doesn't seem to be a GC pause issue.

________________________________
From: Quanlong Huang <hu...@gmail.com>
Sent: Sunday, January 13, 2019 7:28:46 PM
To: user@hadoop.apache.org
Cc: xicheng.dong@hulu.com; anning.luo@hulu.com; chang.wu@hulu.com
Subject: NameNode timeout to writing to QJM after using G1GC in a large heap

Hi all,

Our HDFS has HA setting which depends on QuorumJournalNodes. We previously use CMS and 120GB heap size for the NameNode. When we increased the heap size to 180GB and enable G1GC for one of the NameNode, it crashed after it became active for ~10 minutes.

The crash is due to a FATAL error that the NameNode timeout to write to any of the JournalNodes. However, no GC pauses were detected at that time. Resources (CPU, disk/network IO) usage were low. For the group mapping, we're using ShellBasedUnixGroupsMapping so the bottleneck may not be in connections with LDAP server.

Here're the NameNode logs:

  1.  2019-01-09 18:54:07,788 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
  2.  2019-01-09 18:54:10,042 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  3.  2019-01-09 18:54:11,043 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7003 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  4.  2019-01-09 18:54:12,045 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8004 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  5.  2019-01-09 18:54:13,046 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  6.  2019-01-09 18:54:14,047 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10007 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  7.  2019-01-09 18:54:15,049 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11008 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  8.  2019-01-09 18:54:16,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12009 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  9.  2019-01-09 18:54:17,050 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13010 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  10. 2019-01-09 18:54:18,052 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14011 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  11. 2019-01-09 18:54:19,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15012 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  12. 2019-01-09 18:54:20,053 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  13. 2019-01-09 18:54:21,055 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  14. 2019-01-09 18:54:22,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  15. 2019-01-09 18:54:23,056 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
  16. 2019-01-09 18:54:24,041 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.88.131.30:8485, 10.88.133.30:8485, 10.88.135.29:8485], stream=QuorumOutputStream starting at txid 46729238851))
  17. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
  18.         at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
  19.         at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
  20.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
  21.         at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
  22.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
  23.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
  24.         at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
  25.         at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
  26.         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
  27.         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3363)
  28.         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
  29.         at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
  30.         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
  31.         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
  32.         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
  33.         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
  34.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
  35.         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
  36.         at java.security.AccessController.doPrivileged(Native Method)
  37.         at javax.security.auth.Subject.doAs(Subject.java:422)
  38.         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
  39.         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
  40. 2019-01-09 18:54:24,042 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 46729238851
  41. 2019-01-09 18:54:24,045 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
  42. 2019-01-09 18:54:42,494 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
  43. /************************************************************
  44. SHUTDOWN_MSG: Shutting down NameNode at metrics-hadoop-warehousestore-nn01.server.hulu.com/10.88.135.13<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmetrics-hadoop-warehousestore-nn01.server.hulu.com%2F10.88.135.13&data=02%7C01%7Cekrogen%40linkedin.com%7C33fe654bcef9482fc73608d679d088da%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636830333907629483&sdata=9thxssrkjHHf4OG5FFxIE2mgBVLuU4hpfghrdt0OnS8%3D&reserved=0>
  45. ************************************************************/

Only one complain of JvmPauseMonitor was shown far early before the timeout:

  1.  2019-01-09 18:50:44,772 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1997ms
  2.  No GCs detected

In the QuaromJournalNodes, no warning logs are found in that period. It seems the NameNode fails to connect to any of the JournalNodes.

Here're the GC options we set for G1GC:

  1.  -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:ParallelGCThreads=32 -XX:ConcGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hadoop-hdfs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=5M -verbose.gc

Could anyone give some thoughts?

Thanks,
Quanlong Huang