You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Péter Sinóros-Szabó <pe...@transferwise.com.INVALID> on 2021/02/09 10:48:02 UTC

Rebalancing stuck, never finishes

Hi,

I have an application running with 6 instances of it on Kubernetes. All 6
instances (pods) are the same, using the same consumer group id.
Recently we see that when the application is restarted (rolling restart on
K8s), the triggered rebalancing sometimes doesn't finish at all and the
Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60
minutes, sometimes it doesn't.

If it is stuck, then if we stop the application and wait until
kafka-consumer-groups.sh doesn't show the group, and then we restart the
application, then the initial rebalancing finishes just fine and all is
good... until some hours or days later a rolling restart restarts it all
again.

I grabbed some logs from the time when it was continuously rebalancing.
Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers
seem like running on localhost, but that's not true, traffic is routed on a
service mesh...)

2021-02-05T17:00:18.261422532Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id:
2147482641 rack: null) is unavailable or invalid
2021-02-05T17:00:18.261454952Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Rebalance failed.:
org.apache.kafka.common.errors.DisconnectException: null

2021-02-05T17:00:18.499108799Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
thread #42 - KafkaConsumer[topicY]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-43, groupId=fin] Discovered group coordinator
localhost:9204 (id: 2147482641 rack: null)
2021-02-05T17:00:18.499300612Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
thread #42 - KafkaConsumer[topicY]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-43, groupId=fin] (Re-)joining group

No more logs from Kafka Consumer, it seems that the rebalancing doesn't
finish at all, I don't see logs in any of the pods about the partition
assignments being calculated, so my _guess_ is that the rebalancing stucks
in PreparingRebalance phase and never progress from there.

--- About 2 minutes 10 seconds later (sometimes I see a difference here of
1 minutes 10 seconds).

2021-02-05T17:02:29.615402388Z:  fin-df8d589bd-95bsz: INFO:
kafka-coordinator-heartbeat-thread | fin:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id:
2147482641 rack: null) is unavailable or invalid, will attempt rediscovery
2021-02-05T17:02:29.615520075Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-29, groupId=fin] Rebalance failed.:
org.apache.kafka.common.errors.RebalanceInProgressException: The group is
rebalancing, so a rejoin is needed.

--- This last line may has a difference reason for rebalance too:
"Rebalance failed.: org.apache.kafka.common.errors.DisconnectException:
null"

2021-02-05T17:02:29.74932507Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] Discovered group coordinator
localhost:9204 (id: 2147482641 rack: null)
2021-02-05T17:02:29.749488204Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
thread #2 - KafkaConsumer[topicX]:
org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
clientId=consumer-fin-3, groupId=fin] (Re-)joining group

... and the same repeats forever.

Kafka Client version: 2.6.x
Kafka Broker version: 2.4.1


What can be the reason for this failing rebalance?

I found this bug on 2.4.1, is it possible that I hit this issue?
https://issues.apache.org/jira/browse/KAFKA-9752
"Consumer rebalance can be stuck after new member timeout with old
JoinGroup version"


Thanks for the help,
Peter

Re: Rebalancing stuck, never finishes

Posted by Sophie Blee-Goldman <so...@confluent.io>.

Peter,

It does seem like KAFKA-9752 is the most likely suspect, although if your
clients
were upgraded to 2.6.1 then I don't believe they would be on an early
enough version
of the JoinGroup to run into this. I'm not 100% sure though, it may be a
good idea
to leave a comment on that ticket and ping Jason directly since he
implemented the fix

Murilo,

I agree that your problem is not likely to be KAFKA-9752, since that was
caused by
KAFKA-9232 and that code is not present  in 2.2.1. But maybe you're hitting
up on the
issue which  KAFKA-9232 was originally intended to fix? In any case, 2.2.1
is quite old now
so there may be other known bugs which have since been fixed.

I know it's not always possible/easy, but I would still recommend to
upgrade your brokers to a
more recent version if you can.


On Fri, Feb 26, 2021 at 7:19 AM Murilo Tavares <mu...@gmail.com> wrote:

> Just to provide a bit more detail, I noticed Peter's pattern:
> "Rebalance failed. org.apache.kafka.common.errors.DisconnectException:
> null"
> "(Re-)joining group"
>
> But I also get a different pattern, interchangeably:
> Group coordinator broker-1:9092 (id: 2147483646 rack: null) is unavailable
> or invalid due to cause: null.isDisconnected: true. Rediscovery will be
> attempted.
> Followed by
> Discovered group coordinator broker-1:9092 (id: 2147483646 rack: null)
>
>
>
> On Fri, 26 Feb 2021 at 09:59, Murilo Tavares <mu...@gmail.com> wrote:
>
> > Hi
> > I got the same behaviour yesterday while trying to upgrade my
> KafkaStreams
> > app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.
> >
> > Looking at KAFKA-9752 it mentions the cause being two other tickets:
> > https://issues.apache.org/jira/browse/KAFKA-7610
> > https://issues.apache.org/jira/browse/KAFKA-9232
> >
> > Although the first ticket seems fixed in 2.2.0, the latter was just fixed
> > in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
> > But what I don't understand is that KAFKA-9752 says:
> > "Note that this is only possible if 1) we have a consumer using an old
> > JoinGroup version, 2) the consumer times out and disconnects from its
> > initial JoinGroup request."
> > In this case, I guess my consumer is not using an old JoinGroup, as my
> > consumers (KafkaStreams) are on 2.7.0...
> >
> > Thanks
> > Murilo
> >
> > On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
> > <pe...@transferwise.com.invalid> wrote:
> >
> >> Hey Sophie,
> >>
> >> thanks for the link, I was checking that ticket, but I was not sure if
> it
> >> is relevant for our case.
> >> Eventually we "fixed" our problem with reducing the session.timeout.ms
> >> (it
> >> was set to a high value for other reasons).
> >>
> >> But today, in another service, we faced the same problem when upgrading
> >> the
> >> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the
> brokers.
> >> Do you think the same problem (KAFKA-9752) might cause this problem too?
> >> It's hard to judge just based on the description of that ticket.
> >>
> >> Thanks,
> >> Peter
> >>
> >
>

Re: Rebalancing stuck, never finishes

Posted by Murilo Tavares <mu...@gmail.com>.

Just to provide a bit more detail, I noticed Peter's pattern:
"Rebalance failed. org.apache.kafka.common.errors.DisconnectException: null"
"(Re-)joining group"

But I also get a different pattern, interchangeably:
Group coordinator broker-1:9092 (id: 2147483646 rack: null) is unavailable
or invalid due to cause: null.isDisconnected: true. Rediscovery will be
attempted.
Followed by
Discovered group coordinator broker-1:9092 (id: 2147483646 rack: null)



On Fri, 26 Feb 2021 at 09:59, Murilo Tavares <mu...@gmail.com> wrote:

> Hi
> I got the same behaviour yesterday while trying to upgrade my KafkaStreams
> app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.
>
> Looking at KAFKA-9752 it mentions the cause being two other tickets:
> https://issues.apache.org/jira/browse/KAFKA-7610
> https://issues.apache.org/jira/browse/KAFKA-9232
>
> Although the first ticket seems fixed in 2.2.0, the latter was just fixed
> in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
> But what I don't understand is that KAFKA-9752 says:
> "Note that this is only possible if 1) we have a consumer using an old
> JoinGroup version, 2) the consumer times out and disconnects from its
> initial JoinGroup request."
> In this case, I guess my consumer is not using an old JoinGroup, as my
> consumers (KafkaStreams) are on 2.7.0...
>
> Thanks
> Murilo
>
> On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
> <pe...@transferwise.com.invalid> wrote:
>
>> Hey Sophie,
>>
>> thanks for the link, I was checking that ticket, but I was not sure if it
>> is relevant for our case.
>> Eventually we "fixed" our problem with reducing the session.timeout.ms
>> (it
>> was set to a high value for other reasons).
>>
>> But today, in another service, we faced the same problem when upgrading
>> the
>> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
>> Do you think the same problem (KAFKA-9752) might cause this problem too?
>> It's hard to judge just based on the description of that ticket.
>>
>> Thanks,
>> Peter
>>
>

Re: Rebalancing stuck, never finishes

Posted by Murilo Tavares <mu...@gmail.com>.

Hi
I got the same behaviour yesterday while trying to upgrade my KafkaStreams
app from 2.4.1 to 2.7.0. Our brokers are on 2.2.1.

Looking at KAFKA-9752 it mentions the cause being two other tickets:
https://issues.apache.org/jira/browse/KAFKA-7610
https://issues.apache.org/jira/browse/KAFKA-9232

Although the first ticket seems fixed in 2.2.0, the latter was just fixed
in 2.2.3, so my brokers shouldn't have the code for KAFKA-9232.
But what I don't understand is that KAFKA-9752 says:
"Note that this is only possible if 1) we have a consumer using an old
JoinGroup version, 2) the consumer times out and disconnects from its
initial JoinGroup request."
In this case, I guess my consumer is not using an old JoinGroup, as my
consumers (KafkaStreams) are on 2.7.0...

Thanks
Murilo

On Fri, 26 Feb 2021 at 06:06, Péter Sinóros-Szabó
<pe...@transferwise.com.invalid> wrote:

> Hey Sophie,
>
> thanks for the link, I was checking that ticket, but I was not sure if it
> is relevant for our case.
> Eventually we "fixed" our problem with reducing the session.timeout.ms (it
> was set to a high value for other reasons).
>
> But today, in another service, we faced the same problem when upgrading the
> Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
> Do you think the same problem (KAFKA-9752) might cause this problem too?
> It's hard to judge just based on the description of that ticket.
>
> Thanks,
> Peter
>

Re: Rebalancing stuck, never finishes

Posted by Péter Sinóros-Szabó <pe...@transferwise.com.INVALID>.

Hey Sophie,

thanks for the link, I was checking that ticket, but I was not sure if it
is relevant for our case.
Eventually we "fixed" our problem with reducing the session.timeout.ms (it
was set to a high value for other reasons).

But today, in another service, we faced the same problem when upgrading the
Kafka Client from 2.5.1 to 2.6.1. We are still using 2.4.1 on the brokers.
Do you think the same problem (KAFKA-9752) might cause this problem too?
It's hard to judge just based on the description of that ticket.

Thanks,
Peter

Re: Rebalancing stuck, never finishes

Posted by Sophie Blee-Goldman <so...@confluent.io>.

Hey Peter,

It does sound like you may have hit
https://issues.apache.org/jira/browse/KAFKA-9752

You will need to upgrade your brokers in order to get the fix, since it's a
broker-side issue

On Tue, Feb 9, 2021 at 2:48 AM Péter Sinóros-Szabó
<pe...@transferwise.com.invalid> wrote:

> Hi,
>
> I have an application running with 6 instances of it on Kubernetes. All 6
> instances (pods) are the same, using the same consumer group id.
> Recently we see that when the application is restarted (rolling restart on
> K8s), the triggered rebalancing sometimes doesn't finish at all and the
> Kafka Client stucks in rebalancing. Occasionally it finishes after 30-60
> minutes, sometimes it doesn't.
>
> If it is stuck, then if we stop the application and wait until
> kafka-consumer-groups.sh doesn't show the group, and then we restart the
> application, then the initial rebalancing finishes just fine and all is
> good... until some hours or days later a rolling restart restarts it all
> again.
>
> I grabbed some logs from the time when it was continuously rebalancing.
> Logs are mixed from 6 pods, but all pods have the same logs. (Kafka brokers
> seem like running on localhost, but that's not true, traffic is routed on a
> service mesh...)
>
> 2021-02-05T17:00:18.261422532Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid
> 2021-02-05T17:00:18.261454952Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2021-02-05T17:00:18.499108799Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:00:18.499300612Z:  fin-df8d589bd-85zf9: INFO: Camel (camel-1)
> thread #42 - KafkaConsumer[topicY]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-43, groupId=fin] (Re-)joining group
>
> No more logs from Kafka Consumer, it seems that the rebalancing doesn't
> finish at all, I don't see logs in any of the pods about the partition
> assignments being calculated, so my _guess_ is that the rebalancing stucks
> in PreparingRebalance phase and never progress from there.
>
> --- About 2 minutes 10 seconds later (sometimes I see a difference here of
> 1 minutes 10 seconds).
>
> 2021-02-05T17:02:29.615402388Z:  fin-df8d589bd-95bsz: INFO:
> kafka-coordinator-heartbeat-thread | fin:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-9, groupId=fin] Group coordinator localhost:9204 (id:
> 2147482641 rack: null) is unavailable or invalid, will attempt rediscovery
> 2021-02-05T17:02:29.615520075Z:  fin-df8d589bd-95bsz: INFO: Camel (camel-1)
> thread #28 - KafkaConsumer[twcard.plastic.events.finance.reconciliation]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-29, groupId=fin] Rebalance failed.:
> org.apache.kafka.common.errors.RebalanceInProgressException: The group is
> rebalancing, so a rejoin is needed.
>
> --- This last line may has a difference reason for rebalance too:
> "Rebalance failed.: org.apache.kafka.common.errors.DisconnectException:
> null"
>
> 2021-02-05T17:02:29.74932507Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] Discovered group coordinator
> localhost:9204 (id: 2147482641 rack: null)
> 2021-02-05T17:02:29.749488204Z:  fin-df8d589bd-j8mw6: INFO: Camel (camel-1)
> thread #2 - KafkaConsumer[topicX]:
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator: [Consumer
> clientId=consumer-fin-3, groupId=fin] (Re-)joining group
>
> ... and the same repeats forever.
>
> Kafka Client version: 2.6.x
> Kafka Broker version: 2.4.1
>
>
> What can be the reason for this failing rebalance?
>
> I found this bug on 2.4.1, is it possible that I hit this issue?
> https://issues.apache.org/jira/browse/KAFKA-9752
> "Consumer rebalance can be stuck after new member timeout with old
> JoinGroup version"
>
>
> Thanks for the help,
> Peter
>