You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Avshalom Manevich <av...@gmail.com> on 2019/12/06 15:12:09 UTC

Kafka consumer group keeps moving to PreparingRebalance and stops consuming

We have a Kafka Streams consumer group that keep moving to
PreparingRebalance state and stop consuming. The pattern is as follows:

   1.

   Consumer group is running and stable for around 20 minutes
   2.

   New consumers (members) start to appear in the group state without any
   clear reason, these new members only originate from a small number of VMs
   (not the same VMs each time), and they keep joining
   3. Group state changes to PreparingRebalance
   4. All consumers stop consuming, showing these logs: "Group coordinator
   ... is unavailable or invalid, will attempt rediscovery"
   5. The consumer on VMs that generated extra members show these logs:

Offset commit failed on partition X at offset Y: The coordinator is not
aware of this member.

Failed to commit stream task X since it got migrated to another thread
already. Closing it as zombie before triggering a new rebalance.

Detected task Z that got migrated to another thread. This implies that this
thread missed a rebalance and dropped out of the consumer group. Will try
to rejoin the consumer group.


   1. We kill all consumer processes on all VMs, the group moves to Empty
   with 0 members, we start the processes and we're back to step 1

Kafka version is 1.1.0, streams version is 2.0.0

We took thread dumps from the misbehaving consumers, and didn't see more
consumer threads than configured.

We tried restarting kafka brokers, cleaning zookeeper cache.

We suspect that the issue has to do with missing heartbeats, but the
default heartbeat is 3 seconds and message handling times are no where near
that.

Anyone encountered a similar behaviour?

Re: Kafka consumer group keeps moving to PreparingRebalance and stops consuming

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Avshalom,

I think the first question to answer is where are the new consumers coming
from. From your description they seem to be not expected (i.e. you did not
intentionally start up new instances), so looking at those VMs that
suddenly start new consumers would be my first shot.


Guozhang

On Sun, Dec 8, 2019 at 2:28 AM Avshalom Manevich <av...@gmail.com>
wrote:

> Hi Boyang,
>
> Thanks for your reply.
> We looked into this direction, but since we didn't change max.poll.interval
> from its default value, we're not sure if it's the case.
>
>
> On Fri, 6 Dec 2019 at 17:42, Boyang Chen <re...@gmail.com>
> wrote:
>
> > Hey Avshalom,
> >
> > the consumer instance is initiated per stream thread. You will not be
> > creating new consumers so the root cause is definitely member timeout.
> > Have you changed the max.poll.interval by any chance? That config
> controls
> > how long you tolerate the interval between poll calls to make sure
> progress
> > is being made. If it's very tight, the consumer could
> > stop sending heartbeats once progress is slow.
> >
> > Best,
> > Boyang
> >
> > On Fri, Dec 6, 2019 at 7:12 AM Avshalom Manevich <av...@gmail.com>
> > wrote:
> >
> > > We have a Kafka Streams consumer group that keep moving to
> > > PreparingRebalance state and stop consuming. The pattern is as follows:
> > >
> > >    1.
> > >
> > >    Consumer group is running and stable for around 20 minutes
> > >    2.
> > >
> > >    New consumers (members) start to appear in the group state without
> any
> > >    clear reason, these new members only originate from a small number
> of
> > > VMs
> > >    (not the same VMs each time), and they keep joining
> > >    3. Group state changes to PreparingRebalance
> > >    4. All consumers stop consuming, showing these logs: "Group
> > coordinator
> > >    ... is unavailable or invalid, will attempt rediscovery"
> > >    5. The consumer on VMs that generated extra members show these logs:
> > >
> > > Offset commit failed on partition X at offset Y: The coordinator is not
> > > aware of this member.
> > >
> > > Failed to commit stream task X since it got migrated to another thread
> > > already. Closing it as zombie before triggering a new rebalance.
> > >
> > > Detected task Z that got migrated to another thread. This implies that
> > this
> > > thread missed a rebalance and dropped out of the consumer group. Will
> try
> > > to rejoin the consumer group.
> > >
> > >
> > >    1. We kill all consumer processes on all VMs, the group moves to
> Empty
> > >    with 0 members, we start the processes and we're back to step 1
> > >
> > > Kafka version is 1.1.0, streams version is 2.0.0
> > >
> > > We took thread dumps from the misbehaving consumers, and didn't see
> more
> > > consumer threads than configured.
> > >
> > > We tried restarting kafka brokers, cleaning zookeeper cache.
> > >
> > > We suspect that the issue has to do with missing heartbeats, but the
> > > default heartbeat is 3 seconds and message handling times are no where
> > near
> > > that.
> > >
> > > Anyone encountered a similar behaviour?
> > >
> >
>
>
> --
> *Avshalom Manevich*
>


-- 
-- Guozhang

Re: Kafka consumer group keeps moving to PreparingRebalance and stops consuming

Posted by Jamie <ja...@aol.co.uk.INVALID>.

Hi Avshalom, 
Have you tried increasing the session timeout? What's the current session timeout?
Regarding the max.poll.interval.ms - this is the maximum time between calls to poll of the consumer, are there any possible scenarios where the processing of one lot of messages from the consumer (max.poll.records) could take longer than the time configured for max.poll.interval.ms? Maybe you could log when the records are returned to the streams task and then when the records have finished processing to determine how long this normally takes?  

Thanks, 
Jamie


-----Original Message-----
From: Avshalom Manevich <av...@gmail.com>
To: users <us...@kafka.apache.org>
Sent: Sun, 8 Dec 2019 10:28
Subject: Re: Kafka consumer group keeps moving to PreparingRebalance and stops consuming

Hi Boyang,

Thanks for your reply.
We looked into this direction, but since we didn't change max.poll.interval
from its default value, we're not sure if it's the case.


On Fri, 6 Dec 2019 at 17:42, Boyang Chen <re...@gmail.com> wrote:

> Hey Avshalom,
>
> the consumer instance is initiated per stream thread. You will not be
> creating new consumers so the root cause is definitely member timeout.
> Have you changed the max.poll.interval by any chance? That config controls
> how long you tolerate the interval between poll calls to make sure progress
> is being made. If it's very tight, the consumer could
> stop sending heartbeats once progress is slow.
>
> Best,
> Boyang
>
> On Fri, Dec 6, 2019 at 7:12 AM Avshalom Manevich <av...@gmail.com>
> wrote:
>
> > We have a Kafka Streams consumer group that keep moving to
> > PreparingRebalance state and stop consuming. The pattern is as follows:
> >
> >    1.
> >
> >    Consumer group is running and stable for around 20 minutes
> >    2.
> >
> >    New consumers (members) start to appear in the group state without any
> >    clear reason, these new members only originate from a small number of
> > VMs
> >    (not the same VMs each time), and they keep joining
> >    3. Group state changes to PreparingRebalance
> >    4. All consumers stop consuming, showing these logs: "Group
> coordinator
> >    ... is unavailable or invalid, will attempt rediscovery"
> >    5. The consumer on VMs that generated extra members show these logs:
> >
> > Offset commit failed on partition X at offset Y: The coordinator is not
> > aware of this member.
> >
> > Failed to commit stream task X since it got migrated to another thread
> > already. Closing it as zombie before triggering a new rebalance.
> >
> > Detected task Z that got migrated to another thread. This implies that
> this
> > thread missed a rebalance and dropped out of the consumer group. Will try
> > to rejoin the consumer group.
> >
> >
> >    1. We kill all consumer processes on all VMs, the group moves to Empty
> >    with 0 members, we start the processes and we're back to step 1
> >
> > Kafka version is 1.1.0, streams version is 2.0.0
> >
> > We took thread dumps from the misbehaving consumers, and didn't see more
> > consumer threads than configured.
> >
> > We tried restarting kafka brokers, cleaning zookeeper cache.
> >
> > We suspect that the issue has to do with missing heartbeats, but the
> > default heartbeat is 3 seconds and message handling times are no where
> near
> > that.
> >
> > Anyone encountered a similar behaviour?
> >
>


-- 
*Avshalom Manevich*

Re: Kafka consumer group keeps moving to PreparingRebalance and stops consuming

Posted by Avshalom Manevich <av...@gmail.com>.

Hi Boyang,

Thanks for your reply.
We looked into this direction, but since we didn't change max.poll.interval
from its default value, we're not sure if it's the case.


On Fri, 6 Dec 2019 at 17:42, Boyang Chen <re...@gmail.com> wrote:

> Hey Avshalom,
>
> the consumer instance is initiated per stream thread. You will not be
> creating new consumers so the root cause is definitely member timeout.
> Have you changed the max.poll.interval by any chance? That config controls
> how long you tolerate the interval between poll calls to make sure progress
> is being made. If it's very tight, the consumer could
> stop sending heartbeats once progress is slow.
>
> Best,
> Boyang
>
> On Fri, Dec 6, 2019 at 7:12 AM Avshalom Manevich <av...@gmail.com>
> wrote:
>
> > We have a Kafka Streams consumer group that keep moving to
> > PreparingRebalance state and stop consuming. The pattern is as follows:
> >
> >    1.
> >
> >    Consumer group is running and stable for around 20 minutes
> >    2.
> >
> >    New consumers (members) start to appear in the group state without any
> >    clear reason, these new members only originate from a small number of
> > VMs
> >    (not the same VMs each time), and they keep joining
> >    3. Group state changes to PreparingRebalance
> >    4. All consumers stop consuming, showing these logs: "Group
> coordinator
> >    ... is unavailable or invalid, will attempt rediscovery"
> >    5. The consumer on VMs that generated extra members show these logs:
> >
> > Offset commit failed on partition X at offset Y: The coordinator is not
> > aware of this member.
> >
> > Failed to commit stream task X since it got migrated to another thread
> > already. Closing it as zombie before triggering a new rebalance.
> >
> > Detected task Z that got migrated to another thread. This implies that
> this
> > thread missed a rebalance and dropped out of the consumer group. Will try
> > to rejoin the consumer group.
> >
> >
> >    1. We kill all consumer processes on all VMs, the group moves to Empty
> >    with 0 members, we start the processes and we're back to step 1
> >
> > Kafka version is 1.1.0, streams version is 2.0.0
> >
> > We took thread dumps from the misbehaving consumers, and didn't see more
> > consumer threads than configured.
> >
> > We tried restarting kafka brokers, cleaning zookeeper cache.
> >
> > We suspect that the issue has to do with missing heartbeats, but the
> > default heartbeat is 3 seconds and message handling times are no where
> near
> > that.
> >
> > Anyone encountered a similar behaviour?
> >
>


-- 
*Avshalom Manevich*

Re: Kafka consumer group keeps moving to PreparingRebalance and stops consuming

Posted by Boyang Chen <re...@gmail.com>.

Hey Avshalom,

the consumer instance is initiated per stream thread. You will not be
creating new consumers so the root cause is definitely member timeout.
Have you changed the max.poll.interval by any chance? That config controls
how long you tolerate the interval between poll calls to make sure progress
is being made. If it's very tight, the consumer could
stop sending heartbeats once progress is slow.

Best,
Boyang

On Fri, Dec 6, 2019 at 7:12 AM Avshalom Manevich <av...@gmail.com>
wrote:

> We have a Kafka Streams consumer group that keep moving to
> PreparingRebalance state and stop consuming. The pattern is as follows:
>
>    1.
>
>    Consumer group is running and stable for around 20 minutes
>    2.
>
>    New consumers (members) start to appear in the group state without any
>    clear reason, these new members only originate from a small number of
> VMs
>    (not the same VMs each time), and they keep joining
>    3. Group state changes to PreparingRebalance
>    4. All consumers stop consuming, showing these logs: "Group coordinator
>    ... is unavailable or invalid, will attempt rediscovery"
>    5. The consumer on VMs that generated extra members show these logs:
>
> Offset commit failed on partition X at offset Y: The coordinator is not
> aware of this member.
>
> Failed to commit stream task X since it got migrated to another thread
> already. Closing it as zombie before triggering a new rebalance.
>
> Detected task Z that got migrated to another thread. This implies that this
> thread missed a rebalance and dropped out of the consumer group. Will try
> to rejoin the consumer group.
>
>
>    1. We kill all consumer processes on all VMs, the group moves to Empty
>    with 0 members, we start the processes and we're back to step 1
>
> Kafka version is 1.1.0, streams version is 2.0.0
>
> We took thread dumps from the misbehaving consumers, and didn't see more
> consumer threads than configured.
>
> We tried restarting kafka brokers, cleaning zookeeper cache.
>
> We suspect that the issue has to do with missing heartbeats, but the
> default heartbeat is 3 seconds and message handling times are no where near
> that.
>
> Anyone encountered a similar behaviour?
>