You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Anna Povzner <an...@confluent.io> on 2018/04/05 19:17:15 UTC

[DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Hi,


I just created KIP-279 to fix edge cases of log divergence for both clean
and unclean leader election configs.


https://cwiki.apache.org/confluence/display/KAFKA/KIP-279%3A+Fix+log+divergence+between+leader+and+follower+after+fast+leader+fail+over


The KIP is basically a follow up to KIP-101, and proposes a slight
extension to the replication protocol to fix edge cases where logs can
diverge due to fast leader fail over.


Feedback and suggestions are welcome!


Thanks,

Anna

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Anna Povzner <an...@confluent.io>.

Thanks everyone for the feedback. I will start a voting thread tomorrow
morning if there are no more comments.

Regards,
Anna


On Wed, Apr 11, 2018 at 3:14 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Anna,
>
> Thanks for the KIP. Looks good to me.
>
> Great point on bounding the cleaning point in a compacted topic by high
> watermark. Filed https://issues.apache.org/jira/browse/KAFKA-6780 to track
> it.
>
> Jun
>
>
> On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:
>
> > Hi,
> >
> >
> > I just created KIP-279 to fix edge cases of log divergence for both clean
> > and unclean leader election configs.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 279%3A+Fix+log+divergence+between+leader+and+follower+
> > after+fast+leader+fail+over
> >
> >
> > The KIP is basically a follow up to KIP-101, and proposes a slight
> > extension to the replication protocol to fix edge cases where logs can
> > diverge due to fast leader fail over.
> >
> >
> > Feedback and suggestions are welcome!
> >
> >
> > Thanks,
> >
> > Anna
> >
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Jun Rao <ju...@confluent.io>.

Hi, Anna,

Thanks for the KIP. Looks good to me.

Great point on bounding the cleaning point in a compacted topic by high
watermark. Filed https://issues.apache.org/jira/browse/KAFKA-6780 to track
it.

Jun

On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:

> Hi,
>
>
> I just created KIP-279 to fix edge cases of log divergence for both clean
> and unclean leader election configs.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 279%3A+Fix+log+divergence+between+leader+and+follower+
> after+fast+leader+fail+over
>
>
> The KIP is basically a follow up to KIP-101, and proposes a slight
> extension to the replication protocol to fix edge cases where logs can
> diverge due to fast leader fail over.
>
>
> Feedback and suggestions are welcome!
>
>
> Thanks,
>
> Anna
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Anna Povzner <an...@confluent.io>.

Ted and Jason, I see now how the description of unclean leader election
made the proposed approach sound more complicated (like there are more
roundtrips). I wrote it in such way to show correctness, where
theoretically, we could compare the "complete epoch lineage", but in
practice we compare only one or two recent leader epochs.

So, Jason's statements are correct. The common case for unclean leader
election is still one roundtrip, including the rare case reported in
KAFKA-6361 (two fast consecutive leader failovers).

I updated the description of handling unclean leader elections to address
this.



On Mon, Apr 9, 2018 at 10:01 AM, Jason Gustafson <ja...@confluent.io> wrote:

> Hey Anna,
>
> Thanks for picking this up! I think the solution looks good to me. Just
> wanted to check my understanding on one part. When describing the handling
> of unclean leader elections, you mention comparing the "complete epoch
> lineage" from both brokers in order to converge on the log. I think this
> makes it sound a bit scarier than it actually is. In practice, it seems
> like we'd only have multiple round trips if we hit a bunch of these already
> rare "fast leader failover" cases consecutively. As far as I know, the case
> I reported is the only instance we've seen in the wild, and with this
> solution, we'd only need one round trip to handle it. So while it may be
> theoretically possible to need multiple round trips for convergence, far
> and away the common case would only require a very small number (usually
> exactly one).
>
> Is that correct?
>
> Thanks,
> Jason
>
> On Fri, Apr 6, 2018 at 5:47 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Makes sense.
> > Thanks for the explanation.
> > -------- Original message --------From: Anna Povzner <an...@confluent.io>
> > Date: 4/6/18  5:38 PM  (GMT-08:00) To: dev@kafka.apache.org Subject: Re:
> > [DISCUSS] KIP-279: Fix log divergence between leader and follower after
> > fast leader fail over
> > Hi Ted,
> >
> > I updated the Rejected Alternatives section with a more thorough
> > description of alternatives and reasoning for choosing the solution we
> > proposed.
> >
> > While it is more clear why the second alternative guarantees one
> roundtrip
> > for the clean leader election case, the proposed solution also guarantees
> > it. This is based on the fact that we cannot have more than one
> > back-to-back leader change due to preferred leader election where the
> > leader is not pushed out of the ISR, which means the follower will have
> at
> > most one leader epoch unknown to the new leader, and so the leader will
> be
> > able to respond with the epoch that the follower knows about in the first
> > response.
> >
> > For unclean leader election case, the second alternative reduces the
> number
> > of roundtrips but for rare cases: we need at least 3 fast leader changes
> to
> > see the advantage. Approximate calculation: Proposed solution requires
> > (N+1)/2 roundtrips for N fast leader changes (worst-case, could be less
> > roundtrips for the same number of leader change); Alternative solution
> > requires at most 2 roundtrips (except super rare cases, where we may want
> > to limit the size of OffsetForLeaderEpoch request). This comes at the
> cost
> > of a bigger change in the OffsetForLeaderEpoch request,
> > larger OffsetForLeaderEpoch request size on average, and additional
> > complexity of dealing with how long the sequence should be for the
> > subsequent OffsetForLeaderEpoch requests, handling the edge/contrived
> cases
> > where sequence may become too long.
> >
> > So, I think, the main trade-off here is improving efficiency of a broker
> > becoming a follower in rare cases of unclean leader election/at least 3
> > fast leader changes vs. less complexity in the common case. The proposed
> > solution in the KIP is for less complexity.
> >
> > Please let me know if you have any concerns or suggestions.
> >
> > Thanks,
> > Anna
> >
> > On Thu, Apr 5, 2018 at 1:33 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For the second alternative which was rejected (The follower sends all
> > > sequences of {leader_epoch, end_offset})
> > >
> > > bq. also increases the size of OffsetForLeaderEpoch request by at least
> > > 64bit
> > >
> > > Though the size increases, the number of roundtrips is reduced
> > meaningfully
> > > which would increase the robustness of the solution.
> > >
> > > Please expand the reasoning for unclean leader election for this
> > > alternative.
> > >
> > > Thanks
> > >
> > > On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io>
> wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > > I just created KIP-279 to fix edge cases of log divergence for both
> > clean
> > > > and unclean leader election configs.
> > > >
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 279%3A+Fix+log+divergence+between+leader+and+follower+
> > > > after+fast+leader+fail+over
> > > >
> > > >
> > > > The KIP is basically a follow up to KIP-101, and proposes a slight
> > > > extension to the replication protocol to fix edge cases where logs
> can
> > > > diverge due to fast leader fail over.
> > > >
> > > >
> > > > Feedback and suggestions are welcome!
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Anna
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Anna,

Thanks for picking this up! I think the solution looks good to me. Just
wanted to check my understanding on one part. When describing the handling
of unclean leader elections, you mention comparing the "complete epoch
lineage" from both brokers in order to converge on the log. I think this
makes it sound a bit scarier than it actually is. In practice, it seems
like we'd only have multiple round trips if we hit a bunch of these already
rare "fast leader failover" cases consecutively. As far as I know, the case
I reported is the only instance we've seen in the wild, and with this
solution, we'd only need one round trip to handle it. So while it may be
theoretically possible to need multiple round trips for convergence, far
and away the common case would only require a very small number (usually
exactly one).

Is that correct?

Thanks,
Jason

On Fri, Apr 6, 2018 at 5:47 PM, Ted Yu <yu...@gmail.com> wrote:

> Makes sense.
> Thanks for the explanation.
> -------- Original message --------From: Anna Povzner <an...@confluent.io>
> Date: 4/6/18  5:38 PM  (GMT-08:00) To: dev@kafka.apache.org Subject: Re:
> [DISCUSS] KIP-279: Fix log divergence between leader and follower after
> fast leader fail over
> Hi Ted,
>
> I updated the Rejected Alternatives section with a more thorough
> description of alternatives and reasoning for choosing the solution we
> proposed.
>
> While it is more clear why the second alternative guarantees one roundtrip
> for the clean leader election case, the proposed solution also guarantees
> it. This is based on the fact that we cannot have more than one
> back-to-back leader change due to preferred leader election where the
> leader is not pushed out of the ISR, which means the follower will have at
> most one leader epoch unknown to the new leader, and so the leader will be
> able to respond with the epoch that the follower knows about in the first
> response.
>
> For unclean leader election case, the second alternative reduces the number
> of roundtrips but for rare cases: we need at least 3 fast leader changes to
> see the advantage. Approximate calculation: Proposed solution requires
> (N+1)/2 roundtrips for N fast leader changes (worst-case, could be less
> roundtrips for the same number of leader change); Alternative solution
> requires at most 2 roundtrips (except super rare cases, where we may want
> to limit the size of OffsetForLeaderEpoch request). This comes at the cost
> of a bigger change in the OffsetForLeaderEpoch request,
> larger OffsetForLeaderEpoch request size on average, and additional
> complexity of dealing with how long the sequence should be for the
> subsequent OffsetForLeaderEpoch requests, handling the edge/contrived cases
> where sequence may become too long.
>
> So, I think, the main trade-off here is improving efficiency of a broker
> becoming a follower in rare cases of unclean leader election/at least 3
> fast leader changes vs. less complexity in the common case. The proposed
> solution in the KIP is for less complexity.
>
> Please let me know if you have any concerns or suggestions.
>
> Thanks,
> Anna
>
> On Thu, Apr 5, 2018 at 1:33 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > For the second alternative which was rejected (The follower sends all
> > sequences of {leader_epoch, end_offset})
> >
> > bq. also increases the size of OffsetForLeaderEpoch request by at least
> > 64bit
> >
> > Though the size increases, the number of roundtrips is reduced
> meaningfully
> > which would increase the robustness of the solution.
> >
> > Please expand the reasoning for unclean leader election for this
> > alternative.
> >
> > Thanks
> >
> > On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:
> >
> > > Hi,
> > >
> > >
> > > I just created KIP-279 to fix edge cases of log divergence for both
> clean
> > > and unclean leader election configs.
> > >
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 279%3A+Fix+log+divergence+between+leader+and+follower+
> > > after+fast+leader+fail+over
> > >
> > >
> > > The KIP is basically a follow up to KIP-101, and proposes a slight
> > > extension to the replication protocol to fix edge cases where logs can
> > > diverge due to fast leader fail over.
> > >
> > >
> > > Feedback and suggestions are welcome!
> > >
> > >
> > > Thanks,
> > >
> > > Anna
> > >
> >
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Ted Yu <yu...@gmail.com>.

Makes sense.
Thanks for the explanation. 
-------- Original message --------From: Anna Povzner <an...@confluent.io> Date: 4/6/18  5:38 PM  (GMT-08:00) To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over 
Hi Ted,

I updated the Rejected Alternatives section with a more thorough
description of alternatives and reasoning for choosing the solution we
proposed.

While it is more clear why the second alternative guarantees one roundtrip
for the clean leader election case, the proposed solution also guarantees
it. This is based on the fact that we cannot have more than one
back-to-back leader change due to preferred leader election where the
leader is not pushed out of the ISR, which means the follower will have at
most one leader epoch unknown to the new leader, and so the leader will be
able to respond with the epoch that the follower knows about in the first
response.

For unclean leader election case, the second alternative reduces the number
of roundtrips but for rare cases: we need at least 3 fast leader changes to
see the advantage. Approximate calculation: Proposed solution requires
(N+1)/2 roundtrips for N fast leader changes (worst-case, could be less
roundtrips for the same number of leader change); Alternative solution
requires at most 2 roundtrips (except super rare cases, where we may want
to limit the size of OffsetForLeaderEpoch request). This comes at the cost
of a bigger change in the OffsetForLeaderEpoch request,
larger OffsetForLeaderEpoch request size on average, and additional
complexity of dealing with how long the sequence should be for the
subsequent OffsetForLeaderEpoch requests, handling the edge/contrived cases
where sequence may become too long.

So, I think, the main trade-off here is improving efficiency of a broker
becoming a follower in rare cases of unclean leader election/at least 3
fast leader changes vs. less complexity in the common case. The proposed
solution in the KIP is for less complexity.

Please let me know if you have any concerns or suggestions.

Thanks,
Anna

On Thu, Apr 5, 2018 at 1:33 PM, Ted Yu <yu...@gmail.com> wrote:

> For the second alternative which was rejected (The follower sends all
> sequences of {leader_epoch, end_offset})
>
> bq. also increases the size of OffsetForLeaderEpoch request by at least
> 64bit
>
> Though the size increases, the number of roundtrips is reduced meaningfully
> which would increase the robustness of the solution.
>
> Please expand the reasoning for unclean leader election for this
> alternative.
>
> Thanks
>
> On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:
>
> > Hi,
> >
> >
> > I just created KIP-279 to fix edge cases of log divergence for both clean
> > and unclean leader election configs.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 279%3A+Fix+log+divergence+between+leader+and+follower+
> > after+fast+leader+fail+over
> >
> >
> > The KIP is basically a follow up to KIP-101, and proposes a slight
> > extension to the replication protocol to fix edge cases where logs can
> > diverge due to fast leader fail over.
> >
> >
> > Feedback and suggestions are welcome!
> >
> >
> > Thanks,
> >
> > Anna
> >
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Anna Povzner <an...@confluent.io>.

Hi Ted,

I updated the Rejected Alternatives section with a more thorough
description of alternatives and reasoning for choosing the solution we
proposed.

While it is more clear why the second alternative guarantees one roundtrip
for the clean leader election case, the proposed solution also guarantees
it. This is based on the fact that we cannot have more than one
back-to-back leader change due to preferred leader election where the
leader is not pushed out of the ISR, which means the follower will have at
most one leader epoch unknown to the new leader, and so the leader will be
able to respond with the epoch that the follower knows about in the first
response.

For unclean leader election case, the second alternative reduces the number
of roundtrips but for rare cases: we need at least 3 fast leader changes to
see the advantage. Approximate calculation: Proposed solution requires
(N+1)/2 roundtrips for N fast leader changes (worst-case, could be less
roundtrips for the same number of leader change); Alternative solution
requires at most 2 roundtrips (except super rare cases, where we may want
to limit the size of OffsetForLeaderEpoch request). This comes at the cost
of a bigger change in the OffsetForLeaderEpoch request,
larger OffsetForLeaderEpoch request size on average, and additional
complexity of dealing with how long the sequence should be for the
subsequent OffsetForLeaderEpoch requests, handling the edge/contrived cases
where sequence may become too long.

So, I think, the main trade-off here is improving efficiency of a broker
becoming a follower in rare cases of unclean leader election/at least 3
fast leader changes vs. less complexity in the common case. The proposed
solution in the KIP is for less complexity.

Please let me know if you have any concerns or suggestions.

Thanks,
Anna

On Thu, Apr 5, 2018 at 1:33 PM, Ted Yu <yu...@gmail.com> wrote:

> For the second alternative which was rejected (The follower sends all
> sequences of {leader_epoch, end_offset})
>
> bq. also increases the size of OffsetForLeaderEpoch request by at least
> 64bit
>
> Though the size increases, the number of roundtrips is reduced meaningfully
> which would increase the robustness of the solution.
>
> Please expand the reasoning for unclean leader election for this
> alternative.
>
> Thanks
>
> On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:
>
> > Hi,
> >
> >
> > I just created KIP-279 to fix edge cases of log divergence for both clean
> > and unclean leader election configs.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 279%3A+Fix+log+divergence+between+leader+and+follower+
> > after+fast+leader+fail+over
> >
> >
> > The KIP is basically a follow up to KIP-101, and proposes a slight
> > extension to the replication protocol to fix edge cases where logs can
> > diverge due to fast leader fail over.
> >
> >
> > Feedback and suggestions are welcome!
> >
> >
> > Thanks,
> >
> > Anna
> >
>

Re: [DISCUSS] KIP-279: Fix log divergence between leader and follower after fast leader fail over

Posted by Ted Yu <yu...@gmail.com>.

For the second alternative which was rejected (The follower sends all
sequences of {leader_epoch, end_offset})

bq. also increases the size of OffsetForLeaderEpoch request by at least
64bit

Though the size increases, the number of roundtrips is reduced meaningfully
which would increase the robustness of the solution.

Please expand the reasoning for unclean leader election for this
alternative.

Thanks

On Thu, Apr 5, 2018 at 12:17 PM, Anna Povzner <an...@confluent.io> wrote:

> Hi,
>
>
> I just created KIP-279 to fix edge cases of log divergence for both clean
> and unclean leader election configs.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 279%3A+Fix+log+divergence+between+leader+and+follower+
> after+fast+leader+fail+over
>
>
> The KIP is basically a follow up to KIP-101, and proposes a slight
> extension to the replication protocol to fix edge cases where logs can
> diverge due to fast leader fail over.
>
>
> Feedback and suggestions are welcome!
>
>
> Thanks,
>
> Anna
>