You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/08/12 21:13:12 UTC
[jira] [Resolved] (HBASE-3206) Detect slow GC loops of death

     [ https://issues.apache.org/jira/browse/HBASE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-3206.
-----------------------------------

    Resolution: Not a Problem

These days we have o.a.h.h.util.JvmPauseMonitor

> Detect slow GC loops of death
> -----------------------------
>
>                 Key: HBASE-3206
>                 URL: https://issues.apache.org/jira/browse/HBASE-3206
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>
> Something that has been bothering me for a while was to understand when a region server was being slow because of frequent and small GC pauses. I usually go into that RS's GC output, watch it going for a while then decide if it's under some kind of memory pressure. Here's an example (grepped "Full" from the GC log):
> {noformat}
> 12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: [CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, real=5.82 secs] 
> 12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: [CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, real=5.08 secs] 
> 12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: [CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, real=5.30 secs] 
> 12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: [CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38 secs] 
> 12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: [CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, real=6.35 secs] 
> 12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: [CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, real=7.01 secs] 
> 12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: [CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20 secs] 
> 12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: [CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, real=5.44 secs] 
> 12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: [CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, real=5.72 secs] 
> 12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: [CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, real=5.06 secs]
> {noformat}
> In this case, that RS's TT was down so it was getting all the non-local maps at the end of the job at the same time... generating a >1000% CPU usage. With scanner caching set to 10k, it's easy to understand that there's memory pressure since we have all those objects in flight that we don't account for.
> One solution I was thinking of was to have a sleeper thread that sleeps for 1 sec all the time and outputs when it sees that it slept for a bit more than 1 sec. Then let's say the region server records that it saw a few of those under x minutes and decides to somehow throttle the traffic.
> What I often saw is that if this situation is kept unnoticed, we end up GCing more and more and in some cases I saw a region server going almost zombie for 2 hours before finally getting it's lease expired.



--
This message was sent by Atlassian JIRA
(v6.2#6252)