You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/11/08 23:30:23 UTC

[jira] Created: (HBASE-3206) Detect slow GC loops of death

Detect slow GC loops of death
-----------------------------

                 Key: HBASE-3206
                 URL: https://issues.apache.org/jira/browse/HBASE-3206
             Project: HBase
          Issue Type: Improvement
            Reporter: Jean-Daniel Cryans
             Fix For: 0.92.0


Something that has been bothering me for a while was to understand when a region server was being slow because of frequent and small GC pauses. I usually go into that RS's GC output, watch it going for a while then decide if it's under some kind of memory pressure. Here's an example (grepped "Full" from the GC log):

{noformat}
12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: [CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, real=5.82 secs] 
12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: [CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, real=5.08 secs] 
12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: [CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, real=5.30 secs] 
12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: [CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38 secs] 
12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: [CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, real=6.35 secs] 
12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: [CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, real=7.01 secs] 
12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: [CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20 secs] 
12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: [CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, real=5.44 secs] 
12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: [CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, real=5.72 secs] 
12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: [CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, real=5.06 secs]
{noformat}

In this case, that RS's TT was down so it was getting all the non-local maps at the end of the job at the same time... generating a >1000% CPU usage. With scanner caching set to 10k, it's easy to understand that there's memory pressure since we have all those objects in flight that we don't account for.

One solution I was thinking of was to have a sleeper thread that sleeps for 1 sec all the time and outputs when it sees that it slept for a bit more than 1 sec. Then let's say the region server records that it saw a few of those under x minutes and decides to somehow throttle the traffic.

What I often saw is that if this situation is kept unnoticed, we end up GCing more and more and in some cases I saw a region server going almost zombie for 2 hours before finally getting it's lease expired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3206) Detect slow GC loops of death

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929831#action_12929831 ] 

Jean-Daniel Cryans commented on HBASE-3206:
-------------------------------------------

Things we could also do when under pressure:

 - jettison the block cache
 - switch new incoming calls that are requesting to cache the blocks to non-caching
 - force flush big memstores
 - close some handlers (I wonder if it's even possible)
 - force lower the scanner caching of incoming next() invocations
 - kill requests that are in flight (evil!), preferably the ones that are the longest running or taking the most resources

I guess it's going to take a lot of careful crafting since doing any of these could make things even worse.

> Detect slow GC loops of death
> -----------------------------
>
>                 Key: HBASE-3206
>                 URL: https://issues.apache.org/jira/browse/HBASE-3206
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Jean-Daniel Cryans
>             Fix For: 0.92.0
>
>
> Something that has been bothering me for a while was to understand when a region server was being slow because of frequent and small GC pauses. I usually go into that RS's GC output, watch it going for a while then decide if it's under some kind of memory pressure. Here's an example (grepped "Full" from the GC log):
> {noformat}
> 12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: [CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, real=5.82 secs] 
> 12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: [CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, real=5.08 secs] 
> 12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: [CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, real=5.30 secs] 
> 12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: [CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38 secs] 
> 12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: [CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, real=6.35 secs] 
> 12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: [CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, real=7.01 secs] 
> 12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: [CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20 secs] 
> 12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: [CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, real=5.44 secs] 
> 12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: [CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, real=5.72 secs] 
> 12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: [CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, real=5.06 secs]
> {noformat}
> In this case, that RS's TT was down so it was getting all the non-local maps at the end of the job at the same time... generating a >1000% CPU usage. With scanner caching set to 10k, it's easy to understand that there's memory pressure since we have all those objects in flight that we don't account for.
> One solution I was thinking of was to have a sleeper thread that sleeps for 1 sec all the time and outputs when it sees that it slept for a bit more than 1 sec. Then let's say the region server records that it saw a few of those under x minutes and decides to somehow throttle the traffic.
> What I often saw is that if this situation is kept unnoticed, we end up GCing more and more and in some cases I saw a region server going almost zombie for 2 hours before finally getting it's lease expired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.