You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Luke Chen (Jira)" <ji...@apache.org> on 2022/09/02 09:07:00 UTC

[jira] [Comment Edited] (KAFKA-14197) Kraft broker fails to startup after topic creation failure

    [ https://issues.apache.org/jira/browse/KAFKA-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599345#comment-17599345 ] 

Luke Chen edited comment on KAFKA-14197 at 9/2/22 9:06 AM:
-----------------------------------------------------------

I think we should have a way to notify ReplicationControlManager the topic doesn't get created successfully, so that it won't send out the partitionChangeRecords while controlled shutdown. But I don't have a good idea how we can achieve that gracefully.

 

cc [~hachikuji]  [~cmccabe] [~jsancio] [~dengziming] 


was (Author: showuon):
I think we should have a way to notify ReplicationControlManager the topic doesn't get created successfully. But I don't have a good idea how we can achieve that gracefully. cc [~hachikuji]  [~cmccabe] [~jsancio] [~dengziming] 

> Kraft broker fails to startup after topic creation failure
> ----------------------------------------------------------
>
>                 Key: KAFKA-14197
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14197
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>            Reporter: Luke Chen
>            Priority: Major
>
> In kraft ControllerWriteEvent, we start by trying to apply the record to controller in-memory state, then sent out the record via raft client. But if there is error during sending the records, there's no way to revert the change to controller in-memory state[1].
> The issue happened when creating topics, controller state is updated with topic and partition metadata (ex: broker to ISR map), but the record doesn't send out successfully (ex: buffer allocation error). Then, when shutting down the node, the controlled shutdown will try to remove the broker from ISR by[2]:
> {code:java}
> generateLeaderAndIsrUpdates("enterControlledShutdown[" + brokerId + "]", brokerId, NO_LEADER, records, brokersToIsrs.partitionsWithBrokerInIsr(brokerId));{code}
>  
> After we appending the partitionChangeRecords, and send to metadata topic successfully, it'll cause the brokers failed to "replay" these partition change since these topic/partitions didn't get created successfully previously.
> Even worse, after restarting the node, all the metadata records will replay again, and the same error happened again, cause the broker cannot start up successfully.
>  
> The error and call stack is like this, basically, it complains the topic image can't be found
> {code:java}
> [2022-09-02 16:29:16,334] ERROR Encountered metadata loading fault: Error replaying metadata log record at offset 81 (org.apache.kafka.server.fault.LoggingFaultHandler)
> java.lang.NullPointerException
>     at org.apache.kafka.image.TopicDelta.replay(TopicDelta.java:69)
>     at org.apache.kafka.image.TopicsDelta.replay(TopicsDelta.java:91)
>     at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:248)
>     at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:186)
>     at kafka.server.metadata.BrokerMetadataListener.$anonfun$loadBatches$3(BrokerMetadataListener.scala:239)
>     at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
>     at kafka.server.metadata.BrokerMetadataListener.kafka$server$metadata$BrokerMetadataListener$$loadBatches(BrokerMetadataListener.scala:232)
>     at kafka.server.metadata.BrokerMetadataListener$HandleCommitsEvent.run(BrokerMetadataListener.scala:113)
>     at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:121)
>     at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:200)
>     at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:173)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
>  
> [1] [https://github.com/apache/kafka/blob/ef65b6e566ef69b2f9b58038c98a5993563d7a68/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L779-L804] 
> [2] [https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1270]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)