You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Derek Wollenstein <de...@klout.com> on 2011/12/07 02:10:41 UTC

Help Tuning HBase GC Settings

I've been working on improving GC settings for HBase RegionServer
instances, and I seem to have run into a bit of a dead end.

Basically I've been trying to tune GC settings and memory settings
appropriately, but I keep on having my server reach something like GC
Death.

My current memory settings are
HEAPSIZE=24576
-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
-XX:NewSize=192m -XX:MaxNewSize=192m
-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
-XX:ParallelGCThreads=6

We've also set hfile.block.cache.size to 0.5 ( believing that incresing the
space for cache should improve performance;  I'm willing to accept that
this could be causing problems, I just haven't seen this reported)

Basically as the service is "warming up", we see reasonable GC Log timings.

...
2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00
sys=0.00, real=0.00 secs]
2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686
secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2011-12-06T10:19:15.534+0000: 783.172:
[CMS-concurrent-abortable-preclean-start]
2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean:
0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960
K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02
secs]
...
This seems to continue for about for about 16 hours
But as we hit some critical time, we end up stuck running multiple young
GC/second.  Note that they at appear to be successful (if I'm reading the
logs correctly)
011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
2011-12-06T23:49:42.522+0000: 49410.161:
[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
sys=0.02, real=0.58 secs]
2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960
K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
sys=0.01, real=0.03 secs]
2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625
secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051
secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
...
Note that we are doing about 2 YGC/second (about 80ms). I believe that this
situation represents us using up some undesirable amount of memory
resources, but I can't find any particular reason why we're going from
using a reasonable amount of GC to using GC constantly.   One theory is
that this is the result of something in the young generation being
promoted?  But I'm not really sure what would be causing
this pathological behavior.  Any ideas are alternative places to look would
be greatly appreciated.

--Derek

Re: Help Tuning HBase GC Settings

Posted by Stack <st...@duboce.net>.

On Tue, Dec 6, 2011 at 5:10 PM, Derek Wollenstein <de...@klout.com> wrote:
> -XX:NewSize=192m -XX:MaxNewSize=192m

What if you tried a bigger newsize?  Maybe the YGC's would run longer
but they'd happen less often?

Its only YGC pauses that are giving you issue?


> We've also set hfile.block.cache.size to 0.5 ( believing that incresing the
> space for cache should improve performance;  I'm willing to accept that
> this could be causing problems, I just haven't seen this reported)
>

This should be fine. If you are doing mostly reads or trying to make
reads come out of cache, upping cache makes sense.


> This seems to continue for about for about 16 hours

Is it for sure duration and not a change in the loading?

If you did a rolling restart at this time -- it'd mess up latencies'
for a short time -- and the server is brought back on line w/ same
regions (see the graceful_stop.sh script in the
http://hbase.apache.org/book.html), GC'ing has same profile?


> But as we hit some critical time, we end up stuck running multiple young
> GC/second.  Note that they at appear to be successful (if I'm reading the
> logs correctly)
> 011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> 166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
> 0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:42.522+0000: 49410.161:
> [CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
> sys=0.02, real=0.58 secs]
> 2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960
> K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
> processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> 9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> sys=0.01, real=0.03 secs]
> 2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
> 2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> 165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
> 0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> 165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
> 0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625
> secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
> 2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051
> secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> 2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> 163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
> 0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> 164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
> 0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> 165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
> 0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> 2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> 165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
> 0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> 2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> 167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
> 0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> ...

Above is hard to read all wrapped in mail.  Put up more of the log in
a pastebin?


> Note that we are doing about 2 YGC/second (about 80ms). I believe that this
> situation represents us using up some undesirable amount of memory
> resources, but I can't find any particular reason why we're going from
> using a reasonable amount of GC to using GC constantly.   One theory is
> that this is the result of something in the young generation being
> promoted?  But I'm not really sure what would be causing
> this pathological behavior.  Any ideas are alternative places to look would
> be greatly appreciated.
>


All latencies go up or just 95/99% percentile?

St.Ack

Re: Help Tuning HBase GC Settings

Posted by Matthias Hofschen <ho...@gmail.com>.

Sorry for coming this late to the thread.
Isn't it also an issue to set -XX:CMSInitiatingOccupancyFraction=60 with a
blockcache of 0.5.
I was under the assumption that means to trigger a gc at 60%, so as soon as
the cache fills up it should always trigger the gc.
We are using a lower blockcache of 0.35, a much higher new size of 1gig,
OccupancyFraction=80.

Matthias

On Wed, Dec 7, 2011 at 7:17 PM, Derek Wollenstein <de...@klout.com> wrote:

> St.Ack --
>    We are definitely running with a high handler count -- I've pastebin'd
> my config to http://pastebin.com/i6PXixYF.  The boxes are seeing moderate,
> but not especially high traffic (topping out around 1000qps).  These
> problems do not seem to occur at the same time as request peaks, nor at the
> same time as compactions (or at least not regularly;  They do occur during
> compactions, but also when no compactions are running)
>
> --Derek
>
> On Wed, Dec 7, 2011 at 8:56 AM, Stack <st...@duboce.net> wrote:
>
> > On Tue, Dec 6, 2011 at 6:30 PM, Derek Wollenstein <de...@klout.com>
> wrote:
> > > Jacques --
> > >    The main problem we really see is that after "a while"(yes, this is
> > > unspecific) we start seeing "random" (e.g. uncorrelated with the key
> > being
> > > retrieved, but correlated with time) timeouts (>500ms) retrieving small
> > (<
> > > 200 byte) rows from hbase.
> >
> > What percentage of your reads do this?  All reads around the time of a
> > YGC flurry.
> >
> > (If you let server run, does it ever move on past many YGCs or does it
> > stay there once it hits the condition?)
> >
> > How many handlers are you running with?  Pastebin your configuration
> > too?  Whats the traffic like on these boxes?  How many
> > requests/second?  Is this happening during a traffic peak?
> >
> > St.Ack
> >
>

Re: Help Tuning HBase GC Settings

Posted by Derek Wollenstein <de...@klout.com>.

St.Ack --
    We are definitely running with a high handler count -- I've pastebin'd
my config to http://pastebin.com/i6PXixYF.  The boxes are seeing moderate,
but not especially high traffic (topping out around 1000qps).  These
problems do not seem to occur at the same time as request peaks, nor at the
same time as compactions (or at least not regularly;  They do occur during
compactions, but also when no compactions are running)

--Derek

On Wed, Dec 7, 2011 at 8:56 AM, Stack <st...@duboce.net> wrote:

> On Tue, Dec 6, 2011 at 6:30 PM, Derek Wollenstein <de...@klout.com> wrote:
> > Jacques --
> >    The main problem we really see is that after "a while"(yes, this is
> > unspecific) we start seeing "random" (e.g. uncorrelated with the key
> being
> > retrieved, but correlated with time) timeouts (>500ms) retrieving small
> (<
> > 200 byte) rows from hbase.
>
> What percentage of your reads do this?  All reads around the time of a
> YGC flurry.
>
> (If you let server run, does it ever move on past many YGCs or does it
> stay there once it hits the condition?)
>
> How many handlers are you running with?  Pastebin your configuration
> too?  Whats the traffic like on these boxes?  How many
> requests/second?  Is this happening during a traffic peak?
>
> St.Ack
>

Re: Help Tuning HBase GC Settings

Posted by Stack <st...@duboce.net>.

On Tue, Dec 6, 2011 at 6:30 PM, Derek Wollenstein <de...@klout.com> wrote:
> Jacques --
>    The main problem we really see is that after "a while"(yes, this is
> unspecific) we start seeing "random" (e.g. uncorrelated with the key being
> retrieved, but correlated with time) timeouts (>500ms) retrieving small (<
> 200 byte) rows from hbase.

What percentage of your reads do this?  All reads around the time of a
YGC flurry.

(If you let server run, does it ever move on past many YGCs or does it
stay there once it hits the condition?)

How many handlers are you running with?  Pastebin your configuration
too?  Whats the traffic like on these boxes?  How many
requests/second?  Is this happening during a traffic peak?

St.Ack

Re: Help Tuning HBase GC Settings

Posted by Jacques <wh...@gmail.com>.

I took a look at our logs and I don't see anywhere that we see that kind of
vicious cycle of gcs.  The only thing I noted it that we have a very high
new ratio of 16 (I believe the default is 2 for AMD64... and I can't
remember why now that we have that set that way).  For reference, the
current settings that work for us with a 12gb heap:

-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly
-XX:MaxGCPauseMillis=100

I'm not sure what more I can suggest.  I wish I remembered more about this
than I do.  The Sun/Oracle guys have put out a couple of presentations at
past JavaOne's that do a lot to describe gc tuning and particular symptoms.
I remember recently by Charlie Hunt that was super useful.  Looked
something like this:
http://www.oracle.com/technetwork/server-storage/ts-4887-159080.pdf

Note, we actually don't monitor request latency because the vast majority
of our work is batch based (and thus we're focused on throughput).  We also
use very little block cache.  We also have way to many regions (based on
benchmarks provided by others), averaging about 700/server.  We also have
much larger object sizes if your average only 200 bytes...

Jacques



On Tue, Dec 6, 2011 at 6:30 PM, Derek Wollenstein <de...@klout.com> wrote:

> Jacques --
>    The main problem we really see is that after "a while"(yes, this is
> unspecific) we start seeing "random" (e.g. uncorrelated with the key being
> retrieved, but correlated with time) timeouts (>500ms) retrieving small (<
> 200 byte) rows from hbase.  The interesting thing is that even a test case
> that seems like it should be very cacheable (retrieving literally the same
> row, over and over again) will have the same responsiveness.  I was just
> suspecting that if we're running 6 GC/second (which is what this hits at
> "peak GC"), each at 25-35ms each, that could certainly explain a decent
> amount of unavailability.
>
> 1) We don't yet have MSLAB enabled, primarily because we're trying to turn
> on settings changes one at a time.  I also wasn't clear if these symptoms
> were consistent with the ones mslab was trying to fix, although I have
> considered it.
> 2) I don't think a four minute pause is good, but I don't think a 120
> second pause is that great either, so it's something we'd have to worry
> abou.
> 3) Yes, we have 8 physical cores (16 HT cores), so that shouldn't be the
> problem.
> 4) Off heap block cache does seem like a cool feature, I will keep it in
> mind.
>
> The main reason I was asking is that we really did see this sudden "jump"
> in GC activity over time, and I was hoping it indicated something like
> setting the block cache setting too high relative to the heap size...
>
>
> On Tue, Dec 6, 2011 at 6:14 PM, Jacques <wh...@gmail.com> wrote:
>
> > I'll start with clearly stating that I'm not a gc specialist.  I spend a
> > bunch of time with it but forget all the things I learn once I solve my
> > problems...
> >
> > What exactly is the problem here?  Does the server become unresponsive
> > after 16 hours?  What happens in the HBase logs for that regionserver?  I
> > believe that you're seeing frequent runs likely because of fragmentation
> of
> > your heap along with your XX:CMSInitiatingOccupancyFraction of 60%.
>  These
> > would be a precursor to a full gc which would likely actually take the
> > server down.
> >
> > A few quick thoughts that you may or may not have run across:
> >
> >   - MSLAB is your friend if you haven't been using it already. See more
> >   here:
> >
> >
> http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
> >   - I can't remember exactly but I feel like the number that used to be
> >   quoted by some was 10 seconds per gb for a full gc.  So you're looking
> > at a
> >   full gc of over ~4 minutes with that size heap once you do arrive at a
> > full
> >   gc.
> >   - If you're okay having unresponsive regions for 4+minutes, you'd also
> >   want to increase your ZooKeeper timeout to allow for it.
> >   - If I remember correctly, at a recent presentation yfrog was utilizing
> >   heaps as high as 64gb but that most people thought that was very risky
> > and
> >   you should run much lower.  The 16gb that Doug quotes is more what
> people
> >   seemed to use.
> >   - I haven't heard about most people setting GC threads specifically.
> >    Since you set the gc threads at 6, I assume you have at least 6 true
> > cores?
> >
> > We used to run our regionservers up around 24gb but had constant
> problems.
> >  Ultimately, we settled down at 12gb heaps with mslab enabled (and at 4mb
> > as opposed to the traditional 2mb default due to our cell sizes).
> >
> > Also, off heap block cache is coming up in the 0.92 release (
> > https://issues.apache.org/jira/browse/HBASE-4027).  That should
> > theoretically allow you to use a bunch more memory for the block cache
> > without the same problems.  Others who are more familiar with the feature
> > would be able to speak better to real world results...
> >
> >
> > Jacques
> >
> >
> >
> >
> > On Tue, Dec 6, 2011 at 5:30 PM, Derek Wollenstein <de...@klout.com>
> wrote:
> >
> > > I will take a look at lowering this;  Unfortunately I'm inheriting
> > existing
> > > settings and trying to be as conservative as possible when making
> > changes.
> > >  I can definitely try lowering the memory -- I've gotten mixed messages
> > on
> > > how much to allocate to the HBase heap.  I'll start taking a look at
> > moving
> > > both of these settings down and see how it affects performance (and
> > trying
> > > to use https://github.com/brianfrankcooper/YCSB/wiki for testing).
> >  Thanks
> > > for the suggestion.
> > >
> > > On Tue, Dec 6, 2011 at 5:20 PM, Doug Meil <
> doug.meil@explorysmedical.com
> > > >wrote:
> > >
> > > >
> > > > There are others that know this GC issues better than myself, but
> > setting
> > > > hfile.block.cache.size to .5 seems a bit aggressive to me.  That's
> 50%
> > of
> > > > the heap right there.
> > > >
> > > > Also, the issue with setting the max-heap to 24Gb is that whenever a
> > full
> > > > GC is required on a heap that size, it's a killer.  Folks at recent
> > Hbase
> > > > hackathons were talking about not going higher than 16Gb for RS.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 12/6/11 8:10 PM, "Derek Wollenstein" <de...@klout.com> wrote:
> > > >
> > > > >I've been working on improving GC settings for HBase RegionServer
> > > > >instances, and I seem to have run into a bit of a dead end.
> > > > >
> > > > >Basically I've been trying to tune GC settings and memory settings
> > > > >appropriately, but I keep on having my server reach something like
> GC
> > > > >Death.
> > > > >
> > > > >My current memory settings are
> > > > >HEAPSIZE=24576
> > > > >-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC
> -XX:+UseCompressedOops
> > > > >-XX:NewSize=192m -XX:MaxNewSize=192m
> > > > >-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc
> -XX:+PrintGCDetails
> > > > >-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
> > > > >-XX:ParallelGCThreads=6
> > > > >
> > > > >We've also set hfile.block.cache.size to 0.5 ( believing that
> > incresing
> > > > >the
> > > > >space for cache should improve performance;  I'm willing to accept
> > that
> > > > >this could be causing problems, I just haven't seen this reported)
> > > > >
> > > > >Basically as the service is "warming up", we see reasonable GC Log
> > > > >timings.
> > > > >
> > > > >...
> > > > >2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
> > > > >2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times:
> > > user=0.00
> > > > >sys=0.00, real=0.00 secs]
> > > > >2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
> > > > >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark:
> > 0.686/0.686
> > > > >secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
> > > > >2011-12-06T10:19:15.523+0000: 783.162:
> [CMS-concurrent-preclean-start]
> > > > >2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
> > > > >0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> > > > >2011-12-06T10:19:15.534+0000: 783.172:
> > > > >[CMS-concurrent-abortable-preclean-start]
> > > > >2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
> > > > >168978K->9067K(176960K), 0.0076540 secs]
> 2394290K->2235970K(3882204K),
> > > > >0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
> > > > >2011-12-06T10:19:16.351+0000: 783.990:
> > > [CMS-concurrent-abortable-preclean:
> > > > >0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
> > > > >2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K
> > (176960
> > > > >K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
> > > > >processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
> > > > >2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00,
> > > real=0.02
> > > > >secs]
> > > > >...
> > > > >This seems to continue for about for about 16 hours
> > > > >But as we hit some critical time, we end up stuck running multiple
> > young
> > > > >GC/second.  Note that they at appear to be successful (if I'm
> reading
> > > the
> > > > >logs correctly)
> > > > >011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> > > > >166477K->7992K(176960K), 0.0390800 secs]
> > > 11642960K->11486691K(19297180K),
> > > > >0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:42.522+0000: 49410.161:
> > > > >[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times:
> > user=1.04
> > > > >sys=0.02, real=0.58 secs]
> > > > >2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K
> > > > >(176960
> > > > >K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak
> > refs
> > > > >processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> > > > >9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> > > > >sys=0.01, real=0.03 secs]
> > > > >2011-12-06T23:49:42.555+0000: 49410.194:
> [CMS-concurrent-sweep-start]
> > > > >2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> > > > >165304K->7868K(176960K), 0.0405890 secs]
> > > 11643677K->11487303K(19297180K),
> > > > >0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> > > > >165180K->6485K(176960K), 0.0389860 secs]
> > > 11027946K->10869726K(19297180K),
> > > > >0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep:
> > > > >0.542/0.625
> > > > >secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> > > > >2011-12-06T23:49:43.181+0000: 49410.819:
> [CMS-concurrent-reset-start]
> > > > >2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset:
> > > > >0.051/0.051
> > > > >secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> > > > >2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> > > > >163797K->7150K(176960K), 0.0409170 secs]
> > > 10380339K->10224698K(19297180K),
> > > > >0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> > > > >164462K->8178K(176960K), 0.0394640 secs]
> > > 10382010K->10226321K(19297180K),
> > > > >0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> > > > >165490K->8303K(176960K), 0.0367330 secs]
> > > 10383633K->10227244K(19297180K),
> > > > >0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> > > > >2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> > > > >165615K->10439K(176960K), 0.0398080 secs]
> > > 10384556K->10229598K(19297180K),
> > > > >0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > > >2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> > > > >167751K->13171K(176960K), 0.0393970 secs]
> > > 10386910K->10233071K(19297180K),
> > > > >0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > > >...
> > > > >Note that we are doing about 2 YGC/second (about 80ms). I believe
> that
> > > > >this
> > > > >situation represents us using up some undesirable amount of memory
> > > > >resources, but I can't find any particular reason why we're going
> from
> > > > >using a reasonable amount of GC to using GC constantly.   One theory
> > is
> > > > >that this is the result of something in the young generation being
> > > > >promoted?  But I'm not really sure what would be causing
> > > > >this pathological behavior.  Any ideas are alternative places to
> look
> > > > >would
> > > > >be greatly appreciated.
> > > > >
> > > > >--Derek
> > > >
> > > >
> > > >
> > >
> >
>

Re: Help Tuning HBase GC Settings

Posted by Derek Wollenstein <de...@klout.com>.

Jacques --
    The main problem we really see is that after "a while"(yes, this is
unspecific) we start seeing "random" (e.g. uncorrelated with the key being
retrieved, but correlated with time) timeouts (>500ms) retrieving small (<
200 byte) rows from hbase.  The interesting thing is that even a test case
that seems like it should be very cacheable (retrieving literally the same
row, over and over again) will have the same responsiveness.  I was just
suspecting that if we're running 6 GC/second (which is what this hits at
"peak GC"), each at 25-35ms each, that could certainly explain a decent
amount of unavailability.

1) We don't yet have MSLAB enabled, primarily because we're trying to turn
on settings changes one at a time.  I also wasn't clear if these symptoms
were consistent with the ones mslab was trying to fix, although I have
considered it.
2) I don't think a four minute pause is good, but I don't think a 120
second pause is that great either, so it's something we'd have to worry
abou.
3) Yes, we have 8 physical cores (16 HT cores), so that shouldn't be the
problem.
4) Off heap block cache does seem like a cool feature, I will keep it in
mind.

The main reason I was asking is that we really did see this sudden "jump"
in GC activity over time, and I was hoping it indicated something like
setting the block cache setting too high relative to the heap size...


On Tue, Dec 6, 2011 at 6:14 PM, Jacques <wh...@gmail.com> wrote:

> I'll start with clearly stating that I'm not a gc specialist.  I spend a
> bunch of time with it but forget all the things I learn once I solve my
> problems...
>
> What exactly is the problem here?  Does the server become unresponsive
> after 16 hours?  What happens in the HBase logs for that regionserver?  I
> believe that you're seeing frequent runs likely because of fragmentation of
> your heap along with your XX:CMSInitiatingOccupancyFraction of 60%.  These
> would be a precursor to a full gc which would likely actually take the
> server down.
>
> A few quick thoughts that you may or may not have run across:
>
>   - MSLAB is your friend if you haven't been using it already. See more
>   here:
>
> http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
>   - I can't remember exactly but I feel like the number that used to be
>   quoted by some was 10 seconds per gb for a full gc.  So you're looking
> at a
>   full gc of over ~4 minutes with that size heap once you do arrive at a
> full
>   gc.
>   - If you're okay having unresponsive regions for 4+minutes, you'd also
>   want to increase your ZooKeeper timeout to allow for it.
>   - If I remember correctly, at a recent presentation yfrog was utilizing
>   heaps as high as 64gb but that most people thought that was very risky
> and
>   you should run much lower.  The 16gb that Doug quotes is more what people
>   seemed to use.
>   - I haven't heard about most people setting GC threads specifically.
>    Since you set the gc threads at 6, I assume you have at least 6 true
> cores?
>
> We used to run our regionservers up around 24gb but had constant problems.
>  Ultimately, we settled down at 12gb heaps with mslab enabled (and at 4mb
> as opposed to the traditional 2mb default due to our cell sizes).
>
> Also, off heap block cache is coming up in the 0.92 release (
> https://issues.apache.org/jira/browse/HBASE-4027).  That should
> theoretically allow you to use a bunch more memory for the block cache
> without the same problems.  Others who are more familiar with the feature
> would be able to speak better to real world results...
>
>
> Jacques
>
>
>
>
> On Tue, Dec 6, 2011 at 5:30 PM, Derek Wollenstein <de...@klout.com> wrote:
>
> > I will take a look at lowering this;  Unfortunately I'm inheriting
> existing
> > settings and trying to be as conservative as possible when making
> changes.
> >  I can definitely try lowering the memory -- I've gotten mixed messages
> on
> > how much to allocate to the HBase heap.  I'll start taking a look at
> moving
> > both of these settings down and see how it affects performance (and
> trying
> > to use https://github.com/brianfrankcooper/YCSB/wiki for testing).
>  Thanks
> > for the suggestion.
> >
> > On Tue, Dec 6, 2011 at 5:20 PM, Doug Meil <doug.meil@explorysmedical.com
> > >wrote:
> >
> > >
> > > There are others that know this GC issues better than myself, but
> setting
> > > hfile.block.cache.size to .5 seems a bit aggressive to me.  That's 50%
> of
> > > the heap right there.
> > >
> > > Also, the issue with setting the max-heap to 24Gb is that whenever a
> full
> > > GC is required on a heap that size, it's a killer.  Folks at recent
> Hbase
> > > hackathons were talking about not going higher than 16Gb for RS.
> > >
> > >
> > >
> > >
> > >
> > > On 12/6/11 8:10 PM, "Derek Wollenstein" <de...@klout.com> wrote:
> > >
> > > >I've been working on improving GC settings for HBase RegionServer
> > > >instances, and I seem to have run into a bit of a dead end.
> > > >
> > > >Basically I've been trying to tune GC settings and memory settings
> > > >appropriately, but I keep on having my server reach something like GC
> > > >Death.
> > > >
> > > >My current memory settings are
> > > >HEAPSIZE=24576
> > > >-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
> > > >-XX:NewSize=192m -XX:MaxNewSize=192m
> > > >-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
> > > >-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
> > > >-XX:ParallelGCThreads=6
> > > >
> > > >We've also set hfile.block.cache.size to 0.5 ( believing that
> incresing
> > > >the
> > > >space for cache should improve performance;  I'm willing to accept
> that
> > > >this could be causing problems, I just haven't seen this reported)
> > > >
> > > >Basically as the service is "warming up", we see reasonable GC Log
> > > >timings.
> > > >
> > > >...
> > > >2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
> > > >2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times:
> > user=0.00
> > > >sys=0.00, real=0.00 secs]
> > > >2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
> > > >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark:
> 0.686/0.686
> > > >secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
> > > >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
> > > >2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
> > > >0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> > > >2011-12-06T10:19:15.534+0000: 783.172:
> > > >[CMS-concurrent-abortable-preclean-start]
> > > >2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
> > > >168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
> > > >0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
> > > >2011-12-06T10:19:16.351+0000: 783.990:
> > [CMS-concurrent-abortable-preclean:
> > > >0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
> > > >2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K
> (176960
> > > >K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
> > > >processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
> > > >2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00,
> > real=0.02
> > > >secs]
> > > >...
> > > >This seems to continue for about for about 16 hours
> > > >But as we hit some critical time, we end up stuck running multiple
> young
> > > >GC/second.  Note that they at appear to be successful (if I'm reading
> > the
> > > >logs correctly)
> > > >011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> > > >166477K->7992K(176960K), 0.0390800 secs]
> > 11642960K->11486691K(19297180K),
> > > >0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:42.522+0000: 49410.161:
> > > >[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times:
> user=1.04
> > > >sys=0.02, real=0.58 secs]
> > > >2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K
> > > >(176960
> > > >K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak
> refs
> > > >processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> > > >9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> > > >sys=0.01, real=0.03 secs]
> > > >2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
> > > >2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> > > >165304K->7868K(176960K), 0.0405890 secs]
> > 11643677K->11487303K(19297180K),
> > > >0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> > > >165180K->6485K(176960K), 0.0389860 secs]
> > 11027946K->10869726K(19297180K),
> > > >0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep:
> > > >0.542/0.625
> > > >secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> > > >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
> > > >2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset:
> > > >0.051/0.051
> > > >secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> > > >2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> > > >163797K->7150K(176960K), 0.0409170 secs]
> > 10380339K->10224698K(19297180K),
> > > >0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> > > >164462K->8178K(176960K), 0.0394640 secs]
> > 10382010K->10226321K(19297180K),
> > > >0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> > > >165490K->8303K(176960K), 0.0367330 secs]
> > 10383633K->10227244K(19297180K),
> > > >0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> > > >2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> > > >165615K->10439K(176960K), 0.0398080 secs]
> > 10384556K->10229598K(19297180K),
> > > >0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > >2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> > > >167751K->13171K(176960K), 0.0393970 secs]
> > 10386910K->10233071K(19297180K),
> > > >0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > > >...
> > > >Note that we are doing about 2 YGC/second (about 80ms). I believe that
> > > >this
> > > >situation represents us using up some undesirable amount of memory
> > > >resources, but I can't find any particular reason why we're going from
> > > >using a reasonable amount of GC to using GC constantly.   One theory
> is
> > > >that this is the result of something in the young generation being
> > > >promoted?  But I'm not really sure what would be causing
> > > >this pathological behavior.  Any ideas are alternative places to look
> > > >would
> > > >be greatly appreciated.
> > > >
> > > >--Derek
> > >
> > >
> > >
> >
>

Re: Help Tuning HBase GC Settings

Posted by Jacques <wh...@gmail.com>.

I'll start with clearly stating that I'm not a gc specialist.  I spend a
bunch of time with it but forget all the things I learn once I solve my
problems...

What exactly is the problem here?  Does the server become unresponsive
after 16 hours?  What happens in the HBase logs for that regionserver?  I
believe that you're seeing frequent runs likely because of fragmentation of
your heap along with your XX:CMSInitiatingOccupancyFraction of 60%.  These
would be a precursor to a full gc which would likely actually take the
server down.

A few quick thoughts that you may or may not have run across:

   - MSLAB is your friend if you haven't been using it already. See more
   here:
   http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/
   - I can't remember exactly but I feel like the number that used to be
   quoted by some was 10 seconds per gb for a full gc.  So you're looking at a
   full gc of over ~4 minutes with that size heap once you do arrive at a full
   gc.
   - If you're okay having unresponsive regions for 4+minutes, you'd also
   want to increase your ZooKeeper timeout to allow for it.
   - If I remember correctly, at a recent presentation yfrog was utilizing
   heaps as high as 64gb but that most people thought that was very risky and
   you should run much lower.  The 16gb that Doug quotes is more what people
   seemed to use.
   - I haven't heard about most people setting GC threads specifically.
    Since you set the gc threads at 6, I assume you have at least 6 true cores?

We used to run our regionservers up around 24gb but had constant problems.
 Ultimately, we settled down at 12gb heaps with mslab enabled (and at 4mb
as opposed to the traditional 2mb default due to our cell sizes).

Also, off heap block cache is coming up in the 0.92 release (
https://issues.apache.org/jira/browse/HBASE-4027).  That should
theoretically allow you to use a bunch more memory for the block cache
without the same problems.  Others who are more familiar with the feature
would be able to speak better to real world results...


Jacques




On Tue, Dec 6, 2011 at 5:30 PM, Derek Wollenstein <de...@klout.com> wrote:

> I will take a look at lowering this;  Unfortunately I'm inheriting existing
> settings and trying to be as conservative as possible when making changes.
>  I can definitely try lowering the memory -- I've gotten mixed messages on
> how much to allocate to the HBase heap.  I'll start taking a look at moving
> both of these settings down and see how it affects performance (and trying
> to use https://github.com/brianfrankcooper/YCSB/wiki for testing).  Thanks
> for the suggestion.
>
> On Tue, Dec 6, 2011 at 5:20 PM, Doug Meil <doug.meil@explorysmedical.com
> >wrote:
>
> >
> > There are others that know this GC issues better than myself, but setting
> > hfile.block.cache.size to .5 seems a bit aggressive to me.  That's 50% of
> > the heap right there.
> >
> > Also, the issue with setting the max-heap to 24Gb is that whenever a full
> > GC is required on a heap that size, it's a killer.  Folks at recent Hbase
> > hackathons were talking about not going higher than 16Gb for RS.
> >
> >
> >
> >
> >
> > On 12/6/11 8:10 PM, "Derek Wollenstein" <de...@klout.com> wrote:
> >
> > >I've been working on improving GC settings for HBase RegionServer
> > >instances, and I seem to have run into a bit of a dead end.
> > >
> > >Basically I've been trying to tune GC settings and memory settings
> > >appropriately, but I keep on having my server reach something like GC
> > >Death.
> > >
> > >My current memory settings are
> > >HEAPSIZE=24576
> > >-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
> > >-XX:NewSize=192m -XX:MaxNewSize=192m
> > >-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
> > >-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
> > >-XX:ParallelGCThreads=6
> > >
> > >We've also set hfile.block.cache.size to 0.5 ( believing that incresing
> > >the
> > >space for cache should improve performance;  I'm willing to accept that
> > >this could be causing problems, I just haven't seen this reported)
> > >
> > >Basically as the service is "warming up", we see reasonable GC Log
> > >timings.
> > >
> > >...
> > >2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
> > >2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times:
> user=0.00
> > >sys=0.00, real=0.00 secs]
> > >2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
> > >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686
> > >secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
> > >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
> > >2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
> > >0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> > >2011-12-06T10:19:15.534+0000: 783.172:
> > >[CMS-concurrent-abortable-preclean-start]
> > >2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
> > >168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
> > >0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
> > >2011-12-06T10:19:16.351+0000: 783.990:
> [CMS-concurrent-abortable-preclean:
> > >0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
> > >2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960
> > >K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
> > >processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
> > >2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00,
> real=0.02
> > >secs]
> > >...
> > >This seems to continue for about for about 16 hours
> > >But as we hit some critical time, we end up stuck running multiple young
> > >GC/second.  Note that they at appear to be successful (if I'm reading
> the
> > >logs correctly)
> > >011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> > >166477K->7992K(176960K), 0.0390800 secs]
> 11642960K->11486691K(19297180K),
> > >0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:42.522+0000: 49410.161:
> > >[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
> > >sys=0.02, real=0.58 secs]
> > >2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K
> > >(176960
> > >K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
> > >processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> > >9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> > >sys=0.01, real=0.03 secs]
> > >2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
> > >2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> > >165304K->7868K(176960K), 0.0405890 secs]
> 11643677K->11487303K(19297180K),
> > >0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> > >165180K->6485K(176960K), 0.0389860 secs]
> 11027946K->10869726K(19297180K),
> > >0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep:
> > >0.542/0.625
> > >secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> > >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
> > >2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset:
> > >0.051/0.051
> > >secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> > >2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> > >163797K->7150K(176960K), 0.0409170 secs]
> 10380339K->10224698K(19297180K),
> > >0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> > >164462K->8178K(176960K), 0.0394640 secs]
> 10382010K->10226321K(19297180K),
> > >0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> > >165490K->8303K(176960K), 0.0367330 secs]
> 10383633K->10227244K(19297180K),
> > >0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> > >2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> > >165615K->10439K(176960K), 0.0398080 secs]
> 10384556K->10229598K(19297180K),
> > >0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > >2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> > >167751K->13171K(176960K), 0.0393970 secs]
> 10386910K->10233071K(19297180K),
> > >0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> > >...
> > >Note that we are doing about 2 YGC/second (about 80ms). I believe that
> > >this
> > >situation represents us using up some undesirable amount of memory
> > >resources, but I can't find any particular reason why we're going from
> > >using a reasonable amount of GC to using GC constantly.   One theory is
> > >that this is the result of something in the young generation being
> > >promoted?  But I'm not really sure what would be causing
> > >this pathological behavior.  Any ideas are alternative places to look
> > >would
> > >be greatly appreciated.
> > >
> > >--Derek
> >
> >
> >
>

Re: Help Tuning HBase GC Settings

Posted by Derek Wollenstein <de...@klout.com>.

I will take a look at lowering this;  Unfortunately I'm inheriting existing
settings and trying to be as conservative as possible when making changes.
 I can definitely try lowering the memory -- I've gotten mixed messages on
how much to allocate to the HBase heap.  I'll start taking a look at moving
both of these settings down and see how it affects performance (and trying
to use https://github.com/brianfrankcooper/YCSB/wiki for testing).  Thanks
for the suggestion.

On Tue, Dec 6, 2011 at 5:20 PM, Doug Meil <do...@explorysmedical.com>wrote:

>
> There are others that know this GC issues better than myself, but setting
> hfile.block.cache.size to .5 seems a bit aggressive to me.  That's 50% of
> the heap right there.
>
> Also, the issue with setting the max-heap to 24Gb is that whenever a full
> GC is required on a heap that size, it's a killer.  Folks at recent Hbase
> hackathons were talking about not going higher than 16Gb for RS.
>
>
>
>
>
> On 12/6/11 8:10 PM, "Derek Wollenstein" <de...@klout.com> wrote:
>
> >I've been working on improving GC settings for HBase RegionServer
> >instances, and I seem to have run into a bit of a dead end.
> >
> >Basically I've been trying to tune GC settings and memory settings
> >appropriately, but I keep on having my server reach something like GC
> >Death.
> >
> >My current memory settings are
> >HEAPSIZE=24576
> >-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
> >-XX:NewSize=192m -XX:MaxNewSize=192m
> >-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
> >-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
> >-XX:ParallelGCThreads=6
> >
> >We've also set hfile.block.cache.size to 0.5 ( believing that incresing
> >the
> >space for cache should improve performance;  I'm willing to accept that
> >this could be causing problems, I just haven't seen this reported)
> >
> >Basically as the service is "warming up", we see reasonable GC Log
> >timings.
> >
> >...
> >2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
> >2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00
> >sys=0.00, real=0.00 secs]
> >2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
> >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686
> >secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
> >2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
> >2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
> >0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
> >2011-12-06T10:19:15.534+0000: 783.172:
> >[CMS-concurrent-abortable-preclean-start]
> >2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
> >168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
> >0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
> >2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean:
> >0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
> >2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960
> >K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
> >processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
> >2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02
> >secs]
> >...
> >This seems to continue for about for about 16 hours
> >But as we hit some critical time, we end up stuck running multiple young
> >GC/second.  Note that they at appear to be successful (if I'm reading the
> >logs correctly)
> >011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
> >166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
> >0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:42.522+0000: 49410.161:
> >[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
> >sys=0.02, real=0.58 secs]
> >2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K
> >(176960
> >K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
> >processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
> >9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
> >sys=0.01, real=0.03 secs]
> >2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
> >2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
> >165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
> >0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
> >165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
> >0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep:
> >0.542/0.625
> >secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
> >2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
> >2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset:
> >0.051/0.051
> >secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
> >2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
> >163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
> >0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
> >164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
> >0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
> >165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
> >0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
> >2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
> >165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
> >0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> >2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
> >167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
> >0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
> >...
> >Note that we are doing about 2 YGC/second (about 80ms). I believe that
> >this
> >situation represents us using up some undesirable amount of memory
> >resources, but I can't find any particular reason why we're going from
> >using a reasonable amount of GC to using GC constantly.   One theory is
> >that this is the result of something in the young generation being
> >promoted?  But I'm not really sure what would be causing
> >this pathological behavior.  Any ideas are alternative places to look
> >would
> >be greatly appreciated.
> >
> >--Derek
>
>
>

Re: Help Tuning HBase GC Settings

Posted by Doug Meil <do...@explorysmedical.com>.

Yep.





On 12/7/11 9:21 AM, "Daniel Iancu" <da...@1and1.ro> wrote:

>Will you add this to Hbase manual ? Does it worth trying 24, 32 Gb per
>RS under heavy load?
>
>On 12/07/2011 03:20 AM, Doug Meil wrote:
>> Folks at recent Hbase
>> hackathons were talking about not going higher than 16Gb for RS.
>

Re: Help Tuning HBase GC Settings

Posted by Daniel Iancu <da...@1and1.ro>.

Will you add this to Hbase manual ? Does it worth trying 24, 32 Gb per 
RS under heavy load?

On 12/07/2011 03:20 AM, Doug Meil wrote:
> Folks at recent Hbase
> hackathons were talking about not going higher than 16Gb for RS.

Re: Help Tuning HBase GC Settings

Posted by Doug Meil <do...@explorysmedical.com>.

There are others that know this GC issues better than myself, but setting
hfile.block.cache.size to .5 seems a bit aggressive to me.  That's 50% of
the heap right there.

Also, the issue with setting the max-heap to 24Gb is that whenever a full
GC is required on a heap that size, it's a killer.  Folks at recent Hbase
hackathons were talking about not going higher than 16Gb for RS.





On 12/6/11 8:10 PM, "Derek Wollenstein" <de...@klout.com> wrote:

>I've been working on improving GC settings for HBase RegionServer
>instances, and I seem to have run into a bit of a dead end.
>
>Basically I've been trying to tune GC settings and memory settings
>appropriately, but I keep on having my server reach something like GC
>Death.
>
>My current memory settings are
>HEAPSIZE=24576
>-ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops
>-XX:NewSize=192m -XX:MaxNewSize=192m
>-XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails
>-XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log
>-XX:ParallelGCThreads=6
>
>We've also set hfile.block.cache.size to 0.5 ( believing that incresing
>the
>space for cache should improve performance;  I'm willing to accept that
>this could be causing problems, I just haven't seen this reported)
>
>Basically as the service is "warming up", we see reasonable GC Log
>timings.
>
>...
>2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark:
>2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00
>sys=0.00, real=0.00 secs]
>2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start]
>2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686
>secs] [Times: user=1.98 sys=0.05, real=0.69 secs]
>2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start]
>2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean:
>0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
>2011-12-06T10:19:15.534+0000: 783.172:
>[CMS-concurrent-abortable-preclean-start]
>2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew:
>168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K),
>0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
>2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean:
>0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs]
>2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960
>K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs
>processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)]
>2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02
>secs]
>...
>This seems to continue for about for about 16 hours
>But as we hit some critical time, we end up stuck running multiple young
>GC/second.  Note that they at appear to be successful (if I'm reading the
>logs correctly)
>011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew:
>166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K),
>0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:42.522+0000: 49410.161:
>[CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04
>sys=0.02, real=0.58 secs]
>2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K
>(176960
>K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs
>processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1
>9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17
>sys=0.01, real=0.03 secs]
>2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start]
>2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew:
>165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K),
>0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew:
>165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K),
>0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep:
>0.542/0.625
>secs] [Times: user=1.73 sys=0.02, real=0.62 secs]
>2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start]
>2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset:
>0.051/0.051
>secs] [Times: user=0.10 sys=0.00, real=0.05 secs]
>2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew:
>163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K),
>0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew:
>164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K),
>0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew:
>165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K),
>0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]
>2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew:
>165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K),
>0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
>2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew:
>167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K),
>0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs]
>...
>Note that we are doing about 2 YGC/second (about 80ms). I believe that
>this
>situation represents us using up some undesirable amount of memory
>resources, but I can't find any particular reason why we're going from
>using a reasonable amount of GC to using GC constantly.   One theory is
>that this is the result of something in the young generation being
>promoted?  But I'm not really sure what would be causing
>this pathological behavior.  Any ideas are alternative places to look
>would
>be greatly appreciated.
>
>--Derek