You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Graham Baecher <gb...@hubspot.com> on 2016/05/10 18:23:51 UTC

Re: Slow sync cost

As Bryan mentioned a couple weeks ago, we've been working on an HBase G1GC
tuning blog post. It went up today:
http://product.hubspot.com/blog/g1gc-tuning-your-hbase-cluster

It's a long post, detailing all the stages we went through for
experimenting and tuning. If you're looking for a summary of
recommendations and steps for tuning, you can find it at the end of the
post (linked to at the end of the introduction).

On Thu, Apr 28, 2016 at 12:47 AM, Kevin Bowling <ke...@kev009.com>
wrote:

> Even G1GC will have a 100ms pause time which would trigger this warning.
> Are there any real production clusters that don't constantly trigger this
> warning?  What was the though process in 100ms?  When you go through
> multiple JVMs that could be doing GCs over a network 100ms is not a long
> time!  Spinning disks have tens of ms uncontested.  There's essentially
> zero margin for normal operating latency.
>
> On Wed, Apr 27, 2016 at 7:39 AM, Bryan Beaudreault <
> bbeaudreault@hubspot.com
> > wrote:
>
> > We have 6 production clusters and all of them are tuned differently, so
> I'm
> > not sure there is a setting I could easily give you. It really depends on
> > the usage.  One of our devs wrote a blog post on G1GC fundamentals
> > recently. It's rather long, but could be worth a read:
> >
> >
> http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection
> >
> > We will also have a blog post coming out in the next week or so that
> talks
> > specifically to tuning G1GC for HBase. I can update this thread when
> that's
> > available.
> >
> > On Tue, Apr 26, 2016 at 8:08 PM Saad Mufti <sa...@gmail.com> wrote:
> >
> > > That is interesting. Would it be possible for you to share what GC
> > settings
> > > you ended up on that gave you the most predictable performance?
> > >
> > > Thanks.
> > >
> > > ----
> > > Saad
> > >
> > >
> > > On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault <
> > > bbeaudreault@hubspot.com> wrote:
> > >
> > > > We were seeing this for a while with our CDH5 HBase clusters too. We
> > > > eventually correlated it very closely to GC pauses. Through heavily
> > > tuning
> > > > our GC we were able to drastically reduce the logs, by keeping most
> > GC's
> > > > under 100ms.
> > > >
> > > > On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti <sa...@gmail.com>
> > wrote:
> > > >
> > > > > From what I can see in the source code, the default is actually
> even
> > > > lower
> > > > > at 100 ms (can be overridden with
> > hbase.regionserver.hlog.slowsync.ms
> > > ).
> > > > >
> > > > > ----
> > > > > Saad
> > > > >
> > > > >
> > > > > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <
> > > kevin.bowling@kev009.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I see similar log spam while system has reasonable performance.
> > Was
> > > > the
> > > > > > 250ms default chosen with SSDs and 10ge in mind or something?  I
> > > guess
> > > > > I'm
> > > > > > surprised a sync write several times through JVMs to 2 remote
> > > datanodes
> > > > > > would be expected to consistently happen that fast.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <
> saad.mufti@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > In our large HBase cluster based on CDH 5.5 in AWS, we're
> > > constantly
> > > > > > seeing
> > > > > > > the following messages in the region server logs:
> > > > > > >
> > > > > > > 2016-04-25 14:02:55,178 INFO
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync
> cost:
> > > 258
> > > > > ms,
> > > > > > > current pipeline:
> > > > > > > [DatanodeInfoWithStorage[10.99.182.165:50010
> > > > > > > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > > > > > > DatanodeInfoWithStorage[10.99.182.236:50010
> > > > > > > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > > > > > > DatanodeInfoWithStorage[10.99.182.195:50010
> > > > > > > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> > > > > > >
> > > > > > >
> > > > > > > These happen regularly while HBase appear to be operating
> > normally
> > > > with
> > > > > > > decent read and write performance. We do have occasional
> > > performance
> > > > > > > problems when regions are auto-splitting, and at first I
> thought
> > > this
> > > > > was
> > > > > > > related but now I se it happens all the time.
> > > > > > >
> > > > > > >
> > > > > > > Can someone explain what this means really and should we be
> > > > concerned?
> > > > > I
> > > > > > > tracked down the source code that outputs it in
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> > > > > > >
> > > > > > > but after going through the code I think I'd need to know much
> > more
> > > > > about
> > > > > > > the code to glean anything from it or the associated JIRA
> ticket
> > > > > > > https://issues.apache.org/jira/browse/HBASE-11240.
> > > > > > >
> > > > > > > Also, what is this "pipeline" the ticket and code talks about?
> > > > > > >
> > > > > > > Thanks in advance for any information and/or clarification
> anyone
> > > can
> > > > > > > provide.
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > > Saad
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>