You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Manish Khettry <ma...@ooyala.com> on 2013/02/06 02:41:02 UTC

Consumers constantly rebalancing

We are trying to trouble shoot a problem wherein our system just cannot
seem to read messages fast enough from Kafka. We are on kafka 0.6 and are
using the simple consumer.

>From looking at the logs, and we see a lot (almost constant chatty
messages) about rebalancing. So for instance every minute, we see messages
like this:


Consumer rookery-vacuum-prod_<first_ip>.internal-1360106018385
rebalancing the following partitions: List(0-0, 0-1, 0-10, 0-11, 0-12,
0-13, 0-14, 0-15, 0-16, 0-17, 0-18, 0-19, 0-2, 0-3, 0-4, 0-5, 0-6,
0-7, 0-8, 0-9, 1-0, 1-1, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16,
1-17, 1-18, 1-19, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9) for topic
compact-player-logs with consumers:


I also see zookeeper timeouts like so:

Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
has expired, closing socket connection


We increased the zookeeper session timeout from 6 seconds to 12 seconds and
this seems to have helped somewhat but I'm not sure  if these zookeeper
timeouts at 6 seconds are symptomatic of a problem with our zookeeper
cluster and/or connectivity between the consumers and zk. Any thoughts?

Manish

Re: Consumers constantly rebalancing

Posted by Manish Khettry <ma...@ooyala.com>.
Definitely no long pauses on the consumer. I see a minor collection every
second which uses up 0.1 or 0.2 seconds. That in itself seems a bit on the
higher side (~10-20% time spent in GC) but I don't think that would cause a
zk session timeout. Now getting gc stats on the zookeeper side is a bit
harder-- this is not a system we control!

So in your opinion, long gc pauses are the most likely explanation for this.

m


On Tue, Feb 5, 2013 at 8:27 PM, Jay Kreps <ja...@gmail.com> wrote:

> The easiest way to diagnose is to enable GC logging on both the consumer
> and the zk instance and see if you have long pauses.
>
> -Jay
>
>
> On Tue, Feb 5, 2013 at 5:46 PM, Neha Narkhede <neha.narkhede@gmail.com
> >wrote:
>
> > >> Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
> > has expired, closing socket connection
> >
> > This can happen either due to long GC pauses on your client side or due
> to
> > IO pauses on the zookeeper server side.
> > That is the reason increasing the session timeout seems to have helped.
> > If this error happens frequently, it will cause your consumer instances
> to
> > keep rebalancing.
> >
> > Thanks,
> > Neha
> >
> >
> > On Tue, Feb 5, 2013 at 5:41 PM, Manish Khettry <ma...@ooyala.com>
> wrote:
> >
> > > We are trying to trouble shoot a problem wherein our system just cannot
> > > seem to read messages fast enough from Kafka. We are on kafka 0.6 and
> are
> > > using the simple consumer.
> > >
> > > From looking at the logs, and we see a lot (almost constant chatty
> > > messages) about rebalancing. So for instance every minute, we see
> > messages
> > > like this:
> > >
> > >
> > > Consumer rookery-vacuum-prod_<first_ip>.internal-1360106018385
> > > rebalancing the following partitions: List(0-0, 0-1, 0-10, 0-11, 0-12,
> > > 0-13, 0-14, 0-15, 0-16, 0-17, 0-18, 0-19, 0-2, 0-3, 0-4, 0-5, 0-6,
> > > 0-7, 0-8, 0-9, 1-0, 1-1, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16,
> > > 1-17, 1-18, 1-19, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9) for topic
> > > compact-player-logs with consumers:
> > >
> > >
> > > I also see zookeeper timeouts like so:
> > >
> > > Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
> > > has expired, closing socket connection
> > >
> > >
> > > We increased the zookeeper session timeout from 6 seconds to 12 seconds
> > and
> > > this seems to have helped somewhat but I'm not sure  if these zookeeper
> > > timeouts at 6 seconds are symptomatic of a problem with our zookeeper
> > > cluster and/or connectivity between the consumers and zk. Any thoughts?
> > >
> > > Manish
> > >
> >
>

Re: Consumers constantly rebalancing

Posted by Jay Kreps <ja...@gmail.com>.
The easiest way to diagnose is to enable GC logging on both the consumer
and the zk instance and see if you have long pauses.

-Jay


On Tue, Feb 5, 2013 at 5:46 PM, Neha Narkhede <ne...@gmail.com>wrote:

> >> Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
> has expired, closing socket connection
>
> This can happen either due to long GC pauses on your client side or due to
> IO pauses on the zookeeper server side.
> That is the reason increasing the session timeout seems to have helped.
> If this error happens frequently, it will cause your consumer instances to
> keep rebalancing.
>
> Thanks,
> Neha
>
>
> On Tue, Feb 5, 2013 at 5:41 PM, Manish Khettry <ma...@ooyala.com> wrote:
>
> > We are trying to trouble shoot a problem wherein our system just cannot
> > seem to read messages fast enough from Kafka. We are on kafka 0.6 and are
> > using the simple consumer.
> >
> > From looking at the logs, and we see a lot (almost constant chatty
> > messages) about rebalancing. So for instance every minute, we see
> messages
> > like this:
> >
> >
> > Consumer rookery-vacuum-prod_<first_ip>.internal-1360106018385
> > rebalancing the following partitions: List(0-0, 0-1, 0-10, 0-11, 0-12,
> > 0-13, 0-14, 0-15, 0-16, 0-17, 0-18, 0-19, 0-2, 0-3, 0-4, 0-5, 0-6,
> > 0-7, 0-8, 0-9, 1-0, 1-1, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16,
> > 1-17, 1-18, 1-19, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9) for topic
> > compact-player-logs with consumers:
> >
> >
> > I also see zookeeper timeouts like so:
> >
> > Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
> > has expired, closing socket connection
> >
> >
> > We increased the zookeeper session timeout from 6 seconds to 12 seconds
> and
> > this seems to have helped somewhat but I'm not sure  if these zookeeper
> > timeouts at 6 seconds are symptomatic of a problem with our zookeeper
> > cluster and/or connectivity between the consumers and zk. Any thoughts?
> >
> > Manish
> >
>

Re: Consumers constantly rebalancing

Posted by Neha Narkhede <ne...@gmail.com>.
>> Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
has expired, closing socket connection

This can happen either due to long GC pauses on your client side or due to
IO pauses on the zookeeper server side.
That is the reason increasing the session timeout seems to have helped.
If this error happens frequently, it will cause your consumer instances to
keep rebalancing.

Thanks,
Neha


On Tue, Feb 5, 2013 at 5:41 PM, Manish Khettry <ma...@ooyala.com> wrote:

> We are trying to trouble shoot a problem wherein our system just cannot
> seem to read messages fast enough from Kafka. We are on kafka 0.6 and are
> using the simple consumer.
>
> From looking at the logs, and we see a lot (almost constant chatty
> messages) about rebalancing. So for instance every minute, we see messages
> like this:
>
>
> Consumer rookery-vacuum-prod_<first_ip>.internal-1360106018385
> rebalancing the following partitions: List(0-0, 0-1, 0-10, 0-11, 0-12,
> 0-13, 0-14, 0-15, 0-16, 0-17, 0-18, 0-19, 0-2, 0-3, 0-4, 0-5, 0-6,
> 0-7, 0-8, 0-9, 1-0, 1-1, 1-10, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16,
> 1-17, 1-18, 1-19, 1-2, 1-3, 1-4, 1-5, 1-6, 1-7, 1-8, 1-9) for topic
> compact-player-logs with consumers:
>
>
> I also see zookeeper timeouts like so:
>
> Unable to reconnect to ZooKeeper service, session 0x33c981ab95100ed
> has expired, closing socket connection
>
>
> We increased the zookeeper session timeout from 6 seconds to 12 seconds and
> this seems to have helped somewhat but I'm not sure  if these zookeeper
> timeouts at 6 seconds are symptomatic of a problem with our zookeeper
> cluster and/or connectivity between the consumers and zk. Any thoughts?
>
> Manish
>