You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pieter Hameete <pi...@blockbax.com> on 2021/12/21 15:00:55 UTC

Kafka Streams threads sometimes fail transaction and are fenced after broker restart

Hi all,

After looking for an answer / some discussion on this matter on the community Slack and StackOverflow<https://stackoverflow.com/questions/70335773/kafka-streams-apps-threads-fail-transaction-and-are-fenced-and-restarted-after-k> this mailing list is my last hope :-)

We are noticing that our streams apps threads sometimes fail their transaction and get fenced after a broker restart. After the broker has started up again the streams apps log either an InvalidProducerEpochException: Producer attempted to produce with an old epoch or a ProducedFencedException: There is a newer producer with the same transactionalId which fences the current one. After these exceptions the thread dies and gets restarted, which causes rebalancing and a delay in processing for the partitions assigned to that thread.

Some more details on our setup:

  1.  We use Kafka 2.8 (Confluent Platform 6.2) for Brokers and 2.8.1 for streams apps.
  2.  To ensure smooth broker restarts we use controlled shutdown for our brokers, and restart them 1-by-1 while waiting for all partitions to be in-sync before restarting.
  3.  We use three brokers, with min in-sync replicas set to 2. As far as I know this should facilitate broker restarts that don't affect clients given 2.
  4.  The streams apps are configured with a group instance id and a session timeout that allows for smooth restarts of the streams apps.

In the logs we noticate that during Broker shutdown the clients log NOT_LEADER_OR_FOLLOWER exceptions (this is to be expected when partition leadership is being migrated). Then we see heartbeats failing (expected because broker shutting down, group leadership is migrated). Then we see discovering of a new group coordinator (expected, but bounces a bit between the old and new leader which I didnt expect). Finally the app stabilizes with a new group coordinator.

Then after the broker starts up again we see the clients log FETCH_SESSION_ID_NOT_FOUND exceptions for the starting broker. The starting broker is rediscovered as a transaction coordinator. Shortly after that the InvalidProducerEpochExceptions and ProducedFencedExceptions occur for some Streams app threads causing the thread fencing and restarting.

What could be reason for this happening? My first guess would be that the starting broker is taking over a transaction coordinator before it has synced its transaction states with the in-sync brokers. This difference in transaction state could be a reason the starting broker disagrees on the current producer epoch and/or transactional ids.

Does anyone with more knowledge on this topic have an idea what could be causing the exceptions? Or how we could get more information on what's going on here.

Best regards and thank you in advance!

Pieter Hameete

Re: Kafka Streams threads sometimes fail transaction and are fenced after broker restart

Posted by Pieter Hameete <pi...@blockbax.com>.
Hi Guozhang,

Thank you so much for replying and bringing this KIP to my attention. It certainly looks like this could be the issue we are encountering.

I'll monitor the KIP to see if there's anything we can do to help with testing if that is needed.

Best wishes and have some great holidays!

-- Pieter

________________________________
Van: Guozhang Wang <wa...@gmail.com>
Verzonden: dinsdag 21 december 2021 19:50
Aan: Users <us...@kafka.apache.org>
Onderwerp: Re: Kafka Streams threads sometimes fail transaction and are fenced after broker restart

Hello Pieter,

Thanks for bringing this to the community's attention. After reading your
description I suspect you're hitting this issue:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts

Basically today we did not try to distinguish two cases that are fatal and
recoverable, and just conservatively treat both of them as fatal, causing
e.g. Streams embedded producers to be shutdown and restarted and hence
causing rebalances.


Guozhang

On Tue, Dec 21, 2021 at 7:07 AM Pieter Hameete <pi...@blockbax.com>
wrote:

> Hi all,
>
> After looking for an answer / some discussion on this matter on the
> community Slack and StackOverflow<
> https://stackoverflow.com/questions/70335773/kafka-streams-apps-threads-fail-transaction-and-are-fenced-and-restarted-after-k>
> this mailing list is my last hope :-)
>
> We are noticing that our streams apps threads sometimes fail their
> transaction and get fenced after a broker restart. After the broker has
> started up again the streams apps log either an
> InvalidProducerEpochException: Producer attempted to produce with an old
> epoch or a ProducedFencedException: There is a newer producer with the same
> transactionalId which fences the current one. After these exceptions the
> thread dies and gets restarted, which causes rebalancing and a delay in
> processing for the partitions assigned to that thread.
>
> Some more details on our setup:
>
>   1.  We use Kafka 2.8 (Confluent Platform 6.2) for Brokers and 2.8.1 for
> streams apps.
>   2.  To ensure smooth broker restarts we use controlled shutdown for our
> brokers, and restart them 1-by-1 while waiting for all partitions to be
> in-sync before restarting.
>   3.  We use three brokers, with min in-sync replicas set to 2. As far as
> I know this should facilitate broker restarts that don't affect clients
> given 2.
>   4.  The streams apps are configured with a group instance id and a
> session timeout that allows for smooth restarts of the streams apps.
>
> In the logs we noticate that during Broker shutdown the clients log
> NOT_LEADER_OR_FOLLOWER exceptions (this is to be expected when partition
> leadership is being migrated). Then we see heartbeats failing (expected
> because broker shutting down, group leadership is migrated). Then we see
> discovering of a new group coordinator (expected, but bounces a bit between
> the old and new leader which I didnt expect). Finally the app stabilizes
> with a new group coordinator.
>
> Then after the broker starts up again we see the clients log
> FETCH_SESSION_ID_NOT_FOUND exceptions for the starting broker. The starting
> broker is rediscovered as a transaction coordinator. Shortly after that the
> InvalidProducerEpochExceptions and ProducedFencedExceptions occur for some
> Streams app threads causing the thread fencing and restarting.
>
> What could be reason for this happening? My first guess would be that the
> starting broker is taking over a transaction coordinator before it has
> synced its transaction states with the in-sync brokers. This difference in
> transaction state could be a reason the starting broker disagrees on the
> current producer epoch and/or transactional ids.
>
> Does anyone with more knowledge on this topic have an idea what could be
> causing the exceptions? Or how we could get more information on what's
> going on here.
>
> Best regards and thank you in advance!
>
> Pieter Hameete
>


--
-- Guozhang

Re: Kafka Streams threads sometimes fail transaction and are fenced after broker restart

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Pieter,

Thanks for bringing this to the community's attention. After reading your
description I suspect you're hitting this issue:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts

Basically today we did not try to distinguish two cases that are fatal and
recoverable, and just conservatively treat both of them as fatal, causing
e.g. Streams embedded producers to be shutdown and restarted and hence
causing rebalances.


Guozhang

On Tue, Dec 21, 2021 at 7:07 AM Pieter Hameete <pi...@blockbax.com>
wrote:

> Hi all,
>
> After looking for an answer / some discussion on this matter on the
> community Slack and StackOverflow<
> https://stackoverflow.com/questions/70335773/kafka-streams-apps-threads-fail-transaction-and-are-fenced-and-restarted-after-k>
> this mailing list is my last hope :-)
>
> We are noticing that our streams apps threads sometimes fail their
> transaction and get fenced after a broker restart. After the broker has
> started up again the streams apps log either an
> InvalidProducerEpochException: Producer attempted to produce with an old
> epoch or a ProducedFencedException: There is a newer producer with the same
> transactionalId which fences the current one. After these exceptions the
> thread dies and gets restarted, which causes rebalancing and a delay in
> processing for the partitions assigned to that thread.
>
> Some more details on our setup:
>
>   1.  We use Kafka 2.8 (Confluent Platform 6.2) for Brokers and 2.8.1 for
> streams apps.
>   2.  To ensure smooth broker restarts we use controlled shutdown for our
> brokers, and restart them 1-by-1 while waiting for all partitions to be
> in-sync before restarting.
>   3.  We use three brokers, with min in-sync replicas set to 2. As far as
> I know this should facilitate broker restarts that don't affect clients
> given 2.
>   4.  The streams apps are configured with a group instance id and a
> session timeout that allows for smooth restarts of the streams apps.
>
> In the logs we noticate that during Broker shutdown the clients log
> NOT_LEADER_OR_FOLLOWER exceptions (this is to be expected when partition
> leadership is being migrated). Then we see heartbeats failing (expected
> because broker shutting down, group leadership is migrated). Then we see
> discovering of a new group coordinator (expected, but bounces a bit between
> the old and new leader which I didnt expect). Finally the app stabilizes
> with a new group coordinator.
>
> Then after the broker starts up again we see the clients log
> FETCH_SESSION_ID_NOT_FOUND exceptions for the starting broker. The starting
> broker is rediscovered as a transaction coordinator. Shortly after that the
> InvalidProducerEpochExceptions and ProducedFencedExceptions occur for some
> Streams app threads causing the thread fencing and restarting.
>
> What could be reason for this happening? My first guess would be that the
> starting broker is taking over a transaction coordinator before it has
> synced its transaction states with the in-sync brokers. This difference in
> transaction state could be a reason the starting broker disagrees on the
> current producer epoch and/or transactional ids.
>
> Does anyone with more knowledge on this topic have an idea what could be
> causing the exceptions? Or how we could get more information on what's
> going on here.
>
> Best regards and thank you in advance!
>
> Pieter Hameete
>


-- 
-- Guozhang