You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/16 20:08:00 UTC

[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

    [ https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439970#comment-16439970 ] 

ASF GitHub Bot commented on KAFKA-6361:
---------------------------------------

apovzner opened a new pull request #4882:  KAFKA-6361: Fix log divergence between leader and follower after fast leader fail over
URL: https://github.com/apache/kafka/pull/4882
 
 
   WIP - will add few more unit tests.
   
   Implementation of KIP-279 as described here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-279%3A+Fix+log+divergence+between+leader+and+follower+after+fast+leader+fail+over
   
   In summary:
   - Added leader_epoch to OFFSET_FOR_LEADER_EPOCH_RESPONSE
   - Leader replies with the pair( largest epoch less than or equal to the requested epoch, the end offset of this epoch)
   - If Follower does not know about the leader epoch that leader replies with, it truncates to the end offset of largest leader epoch less than leader epoch that leader replied with, and sends another OffsetForLeaderEpoch request. That request contains the largest leader epoch less than leader epoch that leader replied with.
   
   Added integration test EpochDrivenReplicationProtocolAcceptanceTest.logsShouldNotDivergeOnUncleanLeaderElections that does 3 fast leader changes where unclean leader election is enabled and min isr is 1. The test failed before the fix was implemented.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Fast leader fail over can lead to log divergence between leader and follower
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-6361
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6361
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Assignee: Anna Povzner
>            Priority: Major
>              Labels: reliability
>
> We have observed an edge case in the replication failover logic which can cause a replica to permanently fall out of sync with the leader or, in the worst case, actually have localized divergence between logs. This occurs in spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It appends two batches: one in the range (0, 10) and the other in the range (11, 20). The first one successfully replicates to B, but the second one does not. In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 2. It appends a new batch with offsets (11, n) to its local log. So we now have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the follower, but before it is able to do so, broker B has its own zk session expiration and broker A again becomes leader, now with epoch 3. It then appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch request to broker A with epoch 2. Broker A will respond that epoch 2 ends at offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin fetching from offset n, which will ultimately cause an out of order offset error because broker A will return the full batch beginning from offset 11 which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in the middle of the batch, it will truncate all the way to offset 10. It can begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely go unnoticed in practice and everything is fine in the third case. To workaround the issue, we deleted the active segment on the replica which allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} instead of using the offset of the next highest epoch. That would cause the follower to truncate using its high watermark. Or perhaps instead of doing so, it could send another OffsetForLeaderEpoch request at the next previous cached epoch and then truncate using that. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)