You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Lerh Chuan Low <le...@gmail.com> on 2021/08/05 03:57:27 UTC

Mass transaction abort on topic retention configuration change

Hey folks, I'm wondering if anyone has any insights into what I'm observing
in Kafka. We have a 6 node cluster with our topics consuming a lot of disk
space. All our apps use transactions. We recently ran kafka-topics.sh
--zookeeper --alter --topic --config to alter retention for a single topic,
from infinite (-1) to 1 year. This would result in approximately 3 TB of
space being freed up.Approximately 1 minute later, all our apps which use
vanilla producer/consumer crash due to TimeoutException , while all the
kafka stream apps using txns (EOS) hit TaskMigratedException with producer
fencing. Tracing through the logs, it seems like the timeline of events are:
1. We run the config change, and then Kafka receives the config change from
ZK on its notification handler
2. Every one of the brokers started aborting/rolling back all transactions
approximately 1 minute later

Completed rollback of ongoing transaction for transactionalId....due to timeout

3. Which caused the apps to die/producers in Kafka streams to get fenced
4. Disk segment deletion started happening about 1-2 minutes later
5. Our Disk IOPS go through the roof, due to our stream apps restoring stateOne
of our theories at the moment is maybe it has something to do with our
using of HDDs, but I was wondering if anybody else knows what may have been
the cause. How does log deletion work? Does Kafka have to memory map those
before deleting or something? We observed Kafka logging the deleting of the
segments after all the transactions have been aborted though. Is there some
blocking operation on receiving the notification from Zookeeper or does
that necessarily cause Kafka to abort all active transactions? Should we
expect all txns to be aborted every time we run that script?Any insights
into the internals on how Kafka deletes segments or handles the
notification from ZK will be appreciated, thanks! (This is Kafka 2.8 but we
haven't tried turning on the config for zookeeper-less Kafka)