You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jose Armando Garcia Sancio (Jira)" <ji...@apache.org> on 2021/07/19 16:54:00 UTC

[jira] [Commented] (KAFKA-13100) Controller cannot revert to an in-memory snapshot

    [ https://issues.apache.org/jira/browse/KAFKA-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383453#comment-17383453 ] 

Jose Armando Garcia Sancio commented on KAFKA-13100:
----------------------------------------------------

Here is an example that triggers this error:

1. Controller 3002 starts as inactive and replays a batch with a last offset of 214
{code:java}
[2021-07-16 16:34:46,950] DEBUG [RaftManager nodeId=3002] Follower high watermark updated to 215 (org.apache.kafka.raft.KafkaRaftClient)
[2021-07-16 16:34:46,951] DEBUG [Controller 3002] Executing handleCommits[baseOffset=214]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:46,951] DEBUG [Controller 3002] Replaying commits from the active node up to offset 214. (org.apache.kafka.controller.QuorumController)
{code}
2. Controller 3002 becomes leader for epoch 3
{code:java}
[2021-07-16 16:34:51,852] INFO [RaftManager nodeId=3002] Completed transition to Leader(localId=3002, epoch=3, epochStartOffset=215, highWatermark=Optional.empty, voterStates={3001=ReplicaState(nodeId=3001, endOffset=Optional.empty, lastFetchTimestamp=  OptionalLong.empty, hasAcknowledgedLeader=false), 3002=ReplicaState(nodeId=3002, endOffset=Optional.empty, lastFetchTimestamp=OptionalLong.empty, hasAcknowledgedLeader=true), 3003=ReplicaState(nodeId=3003, endOffset=Optional.empty, lastFetchTimestamp=O  ptionalLong.empty, hasAcknowledgedLeader=false)}) (org.apache.kafka.raft.QuorumState)
[2021-07-16 16:34:51,852] DEBUG [Controller 3002] Executing handleClaim[3]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:51,852] DEBUG [LeaderEpochCache @metadata-0] Appended new epoch entry EpochEntry(epoch=3, startOffset=215). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2021-07-16 16:34:51,852] WARN [Controller 3002] Becoming active at controller epoch 3. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:51,855] DEBUG [Controller 3002] Processed handleClaim[3] in 2117 us (org.apache.kafka.controller.QuorumController){code}
3. Controller 3002 lost leadership
{code:java}
[2021-07-16 16:34:55,578] DEBUG [Controller 3002] Executing handleRenounce[3]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:55,578] WARN [Controller 3002] Renouncing the leadership at oldEpoch 3 due to a metadata log event. Reverting to last committed offset 214. (org.apache.kafka.controller.QuorumController){code}
4. Controller couldn't revert to the committed offset because it didn't generate an in-memory snapshot at the committed offset when it transition to leader.
{code:java}
[2021-07-16 16:34:55,579] WARN [Controller 3002] org.apache.kafka.controller.QuorumController@646b1289: failed with unknown server exception RuntimeException at epoch -1 in 1510 us. Reverting to last committed offset 214. (org.apache.kafka.controller. QuorumController){code}
 

An active controller assumes that there is an in-memory snapshot at the committed offset. The inactive controller only generates an in-memory snapshot when it needs to create an on-disk snapshot.

To fix this the active controller needs to generate an in-memory snapshot at the committed offset when it transition from inactive to active.

> Controller cannot revert to an in-memory snapshot
> -------------------------------------------------
>
>                 Key: KAFKA-13100
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13100
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, kraft
>            Reporter: Jose Armando Garcia Sancio
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Blocker
>              Labels: kip-500
>             Fix For: 3.0.0
>
>
> {code:java}
>   [2021-07-16 16:34:55,578] DEBUG [Controller 3002] Executing handleRenounce[3]. (org.apache.kafka.controller.QuorumController)
>   [2021-07-16 16:34:55,578] WARN [Controller 3002] Renouncing the leadership at oldEpoch 3 due to a metadata log event. Reverting to last committed offset 214. (org.apache.kafka.controller.QuorumController)
>   [2021-07-16 16:34:55,579] WARN [Controller 3002] org.apache.kafka.controller.QuorumController@646b1289: failed with unknown server exception RuntimeException at epoch -1 in 1510 us.  Reverting to last committed offset 214. (org.apache.kafka.controller.  QuorumController)
>   java.lang.RuntimeException: No snapshot for epoch 214. Snapshot epochs are: -1, 1, 3, 5, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 94, 96, 97, 107, 108, 112, 125, 126, 128, 135, 171, 208, 213
>           at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)
>           at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:203)
>           at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:784)
>           at org.apache.kafka.controller.QuorumController.access$2500(QuorumController.java:121)
>           at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleLeaderChange$3(QuorumController.java:769)
>           at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:311)
>           at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
>           at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
>           at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
>           at java.lang.Thread.run(Thread.java:748)
>   [2021-07-16 16:34:55,580] ERROR [Controller 3002] Unexpected exception in handleException (org.apache.kafka.queue.KafkaEventQueue)
>   java.lang.RuntimeException: No snapshot for epoch 214. Snapshot epochs are: -1, 1, 3, 5, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 94, 96, 97, 107, 108, 112, 125, 126, 128, 135, 171, 208, 213
>           at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)
>           at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:203)
>           at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:784)
>           at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:287)
>           at org.apache.kafka.controller.QuorumController.access$500(QuorumController.java:121)
>           at org.apache.kafka.controller.QuorumController$ControlEvent.handleException(QuorumController.java:317)                                                                                                                                                       at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:126)
>           at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
>           at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
>           at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)