You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Jose Armando Garcia Sancio (Jira)" <ji...@apache.org> on 2021/07/19 16:54:00 UTC
[jira] [Commented] (KAFKA-13100) Controller cannot revert to an
in-memory snapshot
[ https://issues.apache.org/jira/browse/KAFKA-13100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383453#comment-17383453 ]
Jose Armando Garcia Sancio commented on KAFKA-13100:
----------------------------------------------------
Here is an example that triggers this error:
1. Controller 3002 starts as inactive and replays a batch with a last offset of 214
{code:java}
[2021-07-16 16:34:46,950] DEBUG [RaftManager nodeId=3002] Follower high watermark updated to 215 (org.apache.kafka.raft.KafkaRaftClient)
[2021-07-16 16:34:46,951] DEBUG [Controller 3002] Executing handleCommits[baseOffset=214]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:46,951] DEBUG [Controller 3002] Replaying commits from the active node up to offset 214. (org.apache.kafka.controller.QuorumController)
{code}
2. Controller 3002 becomes leader for epoch 3
{code:java}
[2021-07-16 16:34:51,852] INFO [RaftManager nodeId=3002] Completed transition to Leader(localId=3002, epoch=3, epochStartOffset=215, highWatermark=Optional.empty, voterStates={3001=ReplicaState(nodeId=3001, endOffset=Optional.empty, lastFetchTimestamp= OptionalLong.empty, hasAcknowledgedLeader=false), 3002=ReplicaState(nodeId=3002, endOffset=Optional.empty, lastFetchTimestamp=OptionalLong.empty, hasAcknowledgedLeader=true), 3003=ReplicaState(nodeId=3003, endOffset=Optional.empty, lastFetchTimestamp=O ptionalLong.empty, hasAcknowledgedLeader=false)}) (org.apache.kafka.raft.QuorumState)
[2021-07-16 16:34:51,852] DEBUG [Controller 3002] Executing handleClaim[3]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:51,852] DEBUG [LeaderEpochCache @metadata-0] Appended new epoch entry EpochEntry(epoch=3, startOffset=215). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2021-07-16 16:34:51,852] WARN [Controller 3002] Becoming active at controller epoch 3. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:51,855] DEBUG [Controller 3002] Processed handleClaim[3] in 2117 us (org.apache.kafka.controller.QuorumController){code}
3. Controller 3002 lost leadership
{code:java}
[2021-07-16 16:34:55,578] DEBUG [Controller 3002] Executing handleRenounce[3]. (org.apache.kafka.controller.QuorumController)
[2021-07-16 16:34:55,578] WARN [Controller 3002] Renouncing the leadership at oldEpoch 3 due to a metadata log event. Reverting to last committed offset 214. (org.apache.kafka.controller.QuorumController){code}
4. Controller couldn't revert to the committed offset because it didn't generate an in-memory snapshot at the committed offset when it transition to leader.
{code:java}
[2021-07-16 16:34:55,579] WARN [Controller 3002] org.apache.kafka.controller.QuorumController@646b1289: failed with unknown server exception RuntimeException at epoch -1 in 1510 us. Reverting to last committed offset 214. (org.apache.kafka.controller. QuorumController){code}
An active controller assumes that there is an in-memory snapshot at the committed offset. The inactive controller only generates an in-memory snapshot when it needs to create an on-disk snapshot.
To fix this the active controller needs to generate an in-memory snapshot at the committed offset when it transition from inactive to active.
> Controller cannot revert to an in-memory snapshot
> -------------------------------------------------
>
> Key: KAFKA-13100
> URL: https://issues.apache.org/jira/browse/KAFKA-13100
> Project: Kafka
> Issue Type: Bug
> Components: controller, kraft
> Reporter: Jose Armando Garcia Sancio
> Assignee: Jose Armando Garcia Sancio
> Priority: Blocker
> Labels: kip-500
> Fix For: 3.0.0
>
>
> {code:java}
> [2021-07-16 16:34:55,578] DEBUG [Controller 3002] Executing handleRenounce[3]. (org.apache.kafka.controller.QuorumController)
> [2021-07-16 16:34:55,578] WARN [Controller 3002] Renouncing the leadership at oldEpoch 3 due to a metadata log event. Reverting to last committed offset 214. (org.apache.kafka.controller.QuorumController)
> [2021-07-16 16:34:55,579] WARN [Controller 3002] org.apache.kafka.controller.QuorumController@646b1289: failed with unknown server exception RuntimeException at epoch -1 in 1510 us. Reverting to last committed offset 214. (org.apache.kafka.controller. QuorumController)
> java.lang.RuntimeException: No snapshot for epoch 214. Snapshot epochs are: -1, 1, 3, 5, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 94, 96, 97, 107, 108, 112, 125, 126, 128, 135, 171, 208, 213
> at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)
> at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:203)
> at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:784)
> at org.apache.kafka.controller.QuorumController.access$2500(QuorumController.java:121)
> at org.apache.kafka.controller.QuorumController$QuorumMetaLogListener.lambda$handleLeaderChange$3(QuorumController.java:769)
> at org.apache.kafka.controller.QuorumController$ControlEvent.run(QuorumController.java:311)
> at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
> at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
> at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
> at java.lang.Thread.run(Thread.java:748)
> [2021-07-16 16:34:55,580] ERROR [Controller 3002] Unexpected exception in handleException (org.apache.kafka.queue.KafkaEventQueue)
> java.lang.RuntimeException: No snapshot for epoch 214. Snapshot epochs are: -1, 1, 3, 5, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61, 94, 96, 97, 107, 108, 112, 125, 126, 128, 135, 171, 208, 213
> at org.apache.kafka.timeline.SnapshotRegistry.getSnapshot(SnapshotRegistry.java:173)
> at org.apache.kafka.timeline.SnapshotRegistry.revertToSnapshot(SnapshotRegistry.java:203)
> at org.apache.kafka.controller.QuorumController.renounce(QuorumController.java:784)
> at org.apache.kafka.controller.QuorumController.handleEventException(QuorumController.java:287)
> at org.apache.kafka.controller.QuorumController.access$500(QuorumController.java:121)
> at org.apache.kafka.controller.QuorumController$ControlEvent.handleException(QuorumController.java:317) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:126)
> at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
> at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
> at java.lang.Thread.run(Thread.java:748) {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)