You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Clark Breyman <cl...@breyman.com> on 2014/04/14 21:41:15 UTC

GC pauses and rebalance failures

I've got some consumers under decent GC pressure and, as a result, they are
having ZK sessions expire and the consumers never recover. I see a number
of rebalance failures in the log after the ZK session expiration followed
by silence (and consumed partitions).

My hypothesis is that, since the GC pause is global to the JVM, I'll have
multiple ConsumerConnectors get expired at the same time and have
synchronized rebalance/backoff cycles. Since rebalance fails if new
consumers join mid balance, the multiple expired connectors will always
collide with each other while attempting to rebalance.

Is this hypothesis crazy? If not, is there a more likely situation? If the
hypothesis isn't crazy, how might I avoid this when the JVM is under GC
pressure?

Thanks in advance.

Re: GC pauses and rebalance failures

Posted by David DeMaagd <dd...@linkedin.com>.
Deliberate variation of the retry/backoff parameters on a per-client basis 
is probably an even more complicated work-around than bumping up the session 
timeout.  I've never tried it because it doesn't really address the probable 
root cause (GC causing client stalls, zookeeper server dropping connections 
because it is timing-sensative, rebalances triggered by watches firing 
because of disconnections - it's problem with zookeeper clients that I am very
familiar with). 

-- 
Dave DeMaagd | S'aite Reliability Engineering, Y'all
ddemaagd@linkedin.com | 818 262 7958

(clark@breyman.com - Mon, Apr 14, 2014 at 01:26:43PM -0700)
> Thanks David. One hypothesis we have is that using different
> rebalance.backoff.ms settings for the different ConsumerConnectors on the
> same JVM will keep them from synchronizing their rebalance attempts enough
> so that one can finish.
> 
> 
> On Mon, Apr 14, 2014 at 12:58 PM, David DeMaagd <dd...@linkedin.com>wrote:
> 
> > Correct - heavy client GC leads to numerous problems.  There's
> > two things you can do:
> >
> > 1) Tune the client JVM better to get GC to a more reasonable level
> > 2) Increase the zookeeper session timeout value (this is generally a
> >    work-around for #1, but it can buy you time to dig into it)
> >
> > --
> > Dave DeMaagd | S'aite Reliability Engineering, Y'all
> > ddemaagd@linkedin.com | 818 262 7958
> >
> > (clark@breyman.com - Mon, Apr 14, 2014 at 12:41:15PM -0700)
> > > I've got some consumers under decent GC pressure and, as a result, they
> > are
> > > having ZK sessions expire and the consumers never recover. I see a number
> > > of rebalance failures in the log after the ZK session expiration followed
> > > by silence (and consumed partitions).
> > >
> > > My hypothesis is that, since the GC pause is global to the JVM, I'll have
> > > multiple ConsumerConnectors get expired at the same time and have
> > > synchronized rebalance/backoff cycles. Since rebalance fails if new
> > > consumers join mid balance, the multiple expired connectors will always
> > > collide with each other while attempting to rebalance.
> > >
> > > Is this hypothesis crazy? If not, is there a more likely situation? If
> > the
> > > hypothesis isn't crazy, how might I avoid this when the JVM is under GC
> > > pressure?
> > >
> > > Thanks in advance.
> >

Re: GC pauses and rebalance failures

Posted by Clark Breyman <cl...@breyman.com>.
Thanks David. One hypothesis we have is that using different
rebalance.backoff.ms settings for the different ConsumerConnectors on the
same JVM will keep them from synchronizing their rebalance attempts enough
so that one can finish.


On Mon, Apr 14, 2014 at 12:58 PM, David DeMaagd <dd...@linkedin.com>wrote:

> Correct - heavy client GC leads to numerous problems.  There's
> two things you can do:
>
> 1) Tune the client JVM better to get GC to a more reasonable level
> 2) Increase the zookeeper session timeout value (this is generally a
>    work-around for #1, but it can buy you time to dig into it)
>
> --
> Dave DeMaagd | S'aite Reliability Engineering, Y'all
> ddemaagd@linkedin.com | 818 262 7958
>
> (clark@breyman.com - Mon, Apr 14, 2014 at 12:41:15PM -0700)
> > I've got some consumers under decent GC pressure and, as a result, they
> are
> > having ZK sessions expire and the consumers never recover. I see a number
> > of rebalance failures in the log after the ZK session expiration followed
> > by silence (and consumed partitions).
> >
> > My hypothesis is that, since the GC pause is global to the JVM, I'll have
> > multiple ConsumerConnectors get expired at the same time and have
> > synchronized rebalance/backoff cycles. Since rebalance fails if new
> > consumers join mid balance, the multiple expired connectors will always
> > collide with each other while attempting to rebalance.
> >
> > Is this hypothesis crazy? If not, is there a more likely situation? If
> the
> > hypothesis isn't crazy, how might I avoid this when the JVM is under GC
> > pressure?
> >
> > Thanks in advance.
>

Re: GC pauses and rebalance failures

Posted by David DeMaagd <dd...@linkedin.com>.
Correct - heavy client GC leads to numerous problems.  There's
two things you can do: 

1) Tune the client JVM better to get GC to a more reasonable level 
2) Increase the zookeeper session timeout value (this is generally a
   work-around for #1, but it can buy you time to dig into it)

-- 
Dave DeMaagd | S'aite Reliability Engineering, Y'all
ddemaagd@linkedin.com | 818 262 7958

(clark@breyman.com - Mon, Apr 14, 2014 at 12:41:15PM -0700)
> I've got some consumers under decent GC pressure and, as a result, they are
> having ZK sessions expire and the consumers never recover. I see a number
> of rebalance failures in the log after the ZK session expiration followed
> by silence (and consumed partitions).
> 
> My hypothesis is that, since the GC pause is global to the JVM, I'll have
> multiple ConsumerConnectors get expired at the same time and have
> synchronized rebalance/backoff cycles. Since rebalance fails if new
> consumers join mid balance, the multiple expired connectors will always
> collide with each other while attempting to rebalance.
> 
> Is this hypothesis crazy? If not, is there a more likely situation? If the
> hypothesis isn't crazy, how might I avoid this when the JVM is under GC
> pressure?
> 
> Thanks in advance.