You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Christopher Vollick <ch...@shopify.com.INVALID> on 2018/09/18 19:36:40 UTC

Semi Proposal: Unclean Leadership Elections and Out of Sync Brokers

Hey! I’m seeing behaviour I didn’t expect, and I’d like some expert opinions on it.

(TL;DR min.insync.replicas=2 and acks=all isn’t enough to tolerate losing a leader transparently and safely, but maybe it could be)

Lets say we have a topic with one partition, replication factor 3.
Leader 0, Followers 1,2
I have min.insync.replicas=2, and my producer has acks=all

Ok, so let’s say there’s an incident, and Broker 2 dies (or is unable to replicate).
That’s still fine, and the producer keeps going.

Then Broker 1 dies (or is unable to replicate).
Now the ISR list will just be Broker 0, and production will stop because of min.insync.replicas and acks=all.

Now, Broker 0 dies.
The partition is now offline, which is expected.

Let’s say Broker 1 comes back before Broker 0 (or maybe Broker 0 is never coming back).
There’s no reason now why Broker 1 can’t assume leadership. Due to acks=all nothing was produced while it was gone, but it currently can’t know that. When it comes back, it sees it’s not in the list of ISRs and assumes it may have missed something.

The only way to get out of this situation is to enable unclean leadership elections, and in this case if Broker 1 assumed leadership that’d be fine, but if Broker 2 assumed leadership then there’d be consumer offset reset, since Broker 1 has messages Broker 2 doesn’t.

So that’s the setup.
So, really all that min.insync.replicas=2 gets me is less Availability earlier, and if Broker 0 comes back, then consistency.
But if it doesn’t, I might lose consistency anyway.
Let me know if I’m wrong about something in there.

Proposal:
What we need is to track is whether a broker is missing messages. We obviously don’t want to spam ZK each time we get a message, and we don’t want to change what ISR means because so many things (including acks=all) rely on that.
So what if we added (for each partition) a list of “out of sync” partitions.
Basically when a broker gets removed from the ISR list, the first time (and only the first time) a produce is accepted they’re added to the OSRs.

That way if Broker 1 comes back it can see that Broker 0 is the leader, but before Broker 0 died it never got any messages Broker 1 doesn’t already have.
So Broker 1 is now safe to take leadership without having to enable unclean leadership, and start syncing to Broker 2, and once that’s in sync with what it missed, we’re now within min.insync.replicas and things become available again.

There are still some edges in here.
Like maybe instead of having things be implicit (“I’m not in this list, so I must be ok”) having an explicit ISRs and UpToDateRs we only take things out of when we receive a message and they’re not in ISR.

Or like what if Broker 1 suffered corruption or data-loss. It probably shouldn’t be able to just start up, so maybe instead of a broker list there’s a kind of highwater-mark we set to the current offset when we remove them from ISR, and then remove when we get a new message.
So they can see on startup “I’m in sync if I have THIS message, which I do, so I’m the leader now”

Basically I just wanted to know if this is ridiculous, or if I’m misunderstanding, or if I should make a KIP or what happens here.
Because right now it feels like setting min.insync.replicas=2 doesn’t actually give me much in a scenario like I outlined above.
With min.insync.replicas=1 producers would think they succeeded, and then the messages would be lost if Broker 0 went offline.
But if I need to enable unclean leadership election anyway to recover, then those messages might be lost anyway, no?

Thanks for reading!