You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Nick Lipple (JIRA)" <ji...@apache.org> on 2018/06/29 22:21:00 UTC
[jira] [Created] (KAFKA-7122) Data is lost when ZooKeeper times out
Nick Lipple created KAFKA-7122:
----------------------------------
Summary: Data is lost when ZooKeeper times out
Key: KAFKA-7122
URL: https://issues.apache.org/jira/browse/KAFKA-7122
Project: Kafka
Issue Type: Bug
Components: core, replication
Affects Versions: 0.11.0.2
Reporter: Nick Lipple
Noticed that a kafka cluster will lose data when a leader for a partition has their zookeeper connection timeout.
Sequence of events:
# Say broker A leads a partition followed by brokers B and C
# A ZK node has a network issue, happens to be the node used by broker A. Lets say this happens at offset X
# Kafka Controller immediately selects broker C as the new partition leader
# Broker A does not timeout from zookeeper for another 4 seconds. Broker A still thinks it is the leader, presumably accepting producer writes.
# Broker A detects the ZK timeout and leaves the ISR.
# Broker A reconnects to ZK, rejoins cluster as follower for partition
# Broker A truncates log to some offset Y such that Y > X. Broker A proceeds to catch up normally and becomes an ISR
# ISRs for partition are now in an inconsistent state:
## Broker C has all offsets X through Y plus everything after
## Broker B has all offsets X through Y plus everything after
## Broker A has offsets up to X and after Y. Everything between X and Y *IS MISSING*
# Within 5 minutes, controller trigger preferred replica election making Broker A the new leader for partition (this is default behavior)
All consumers after step 9 will not receive any messages for offsets between X and Y.
The root problem here seems to be broker A truncates to offset Y when rejoining the cluster. It should be truncating further back to offset X to prevent data loss
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)