You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Boyang Chen (Jira)" <ji...@apache.org> on 2020/03/03 19:21:00 UTC

[jira] [Commented] (KAFKA-9118) LogDirFailureHandler shouldn't use Zookeeper

    [ https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050496#comment-17050496 ] 

Boyang Chen commented on KAFKA-9118:
------------------------------------

[~viktorsomogyi] Thanks for the info. Did you already have a work-in-progress KIP for the logDir API migration? 

> LogDirFailureHandler shouldn't use Zookeeper
> --------------------------------------------
>
>                 Key: KAFKA-9118
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9118
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Viktor Somogyi-Vass
>            Assignee: Viktor Somogyi-Vass
>            Priority: Major
>
> As described in [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this replica. More specifically, the replica is no longer considered leader and is removed from any replica fetcher thread. (The clients will receive a UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under path /log_dir_event_notification with data of the format {"version" : 1, "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state of all topic partitions on the broker. The LeaderAndIsrResponse from this broker will specify KafkaStorageException for those partitions that are on the bad log directories.
> - The controller updates the information of offline replicas in memory and trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a notification message to the controller with the offline partitions. The controller then updates the information of offline replicas in memory and trigger leader election, then removes the replicas from ISR in ZK and sends a LAIR and an UpdateMetadataRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)