You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jack Foy <jf...@hiya.com> on 2018/07/18 18:24:25 UTC

Fwd: Recovering partition leadership outside ISR

Hi all,

We had an outage recently that I think we could have prevented, and I'd
like to get some feedback on the idea.

tl;dr:

When a partition leader writes an updated ISR, it should also record
its current log-end-offset. On leader election, if there are no live
replicas in the ISR, then a replica with this same log-end-offset
should be preferred before considering unclean leader election.

Details and use case:

We had a 5-node Kafka 1.0.0 cluster (since upgraded to 1.1.0) with unclean
leader election disabled. Well-configured topics have replication factor 3 and
min.insync.replicas 2, with producers setting acks=all.

At root, our cloud provider suffered hardware failure, causing a partial
outage on network connectivity to disk storage. Broker 5's storage was on the
orphaned side of the network partition.

At the very start of the incident, broker 5 dropped all followers on brokers 1
and 4 out of the ISR for partitions it was leading. Its connections to brokers
2 and 3 and to Zookeeper stayed up, including to the controller on broker 3.
Broker 5 went offline entirely a few moments later, and stayed down with disk
state inaccessible for several hours.

We had configured multiple partitions with broker 5 as their leader and
followers on brokers 1 and 4. Before the incident those partitions had ISR
{5,1,4}, which shrank to {5} before broker 5 disappeared - leaving us with no
eligible replicas to become leader.

The only ways to bring these partitions back up were to either recover broker
5's up-to-date disk state, or to enable unclean leader election. Had we lost
one follower, then the other, and then the leader, enabling unclean leader
election would have carried 50% risk of message loss.

In the end, we decided that the lowest-risk option was to enable unclean leader
election on the affected topics, force a controller election, watch the
partitions come back up, and disable unclean election. We added this
procedure to our runbooks.

I think there's a safer recovery path that Kafka could support:

The leader should also record its current log-end-offset when it writes an
updated ISR. If the controller determines that it can't promote a replica from
the ISR, it should next look for a replica that has that same log-end-offset.
Only if that step also fails should it then consider unclean leader election.

For our failure case, at least, I think this would have allowed a clean and
automatic recovery. Has this idea been considered before? Does it have
fatal flaws?

Thanks,

--
Jack Foy <jf...@hiya.com>