You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2012/10/17 20:36:03 UTC
[jira] [Commented] (KAFKA-340) Implement clean shutdown in 0.8

    [ https://issues.apache.org/jira/browse/KAFKA-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478202#comment-13478202 ] 

Neha Narkhede commented on KAFKA-340:
-------------------------------------

Thanks for the patch, it looks very well thought out. A few questions/comments -

1. KafkaController
1.1 It seems like we might need to move the filter to the getter instead of the setter for liveBrokerIds. This is because if the shutting down broker list changes after setting the value for live brokers and reading it, the list of live brokers might either miss out alive brokers or include dead brokers.
1.2 Since both alive and shutting down broker lists are set, you can use the set difference notation instead of individually reading and filtering out unwanted brokers. Not sure how sets are implemented under the covers in Scala, but my guess is that the set difference method is more efficient.
1.3 Can we rename replicatedPartitionsBrokerLeads to replicatePartitionsThisBrokerLeads ?
1.4 The leader movement sets the isr correctly for the partition by removing the current leader as well as any broker that is shutting down from the isr. Then, the replica is marked as offline which tries to remove the broker from the isr again. Let's ignore the fact that the OfflineReplica state change reads the isr from zookeeper since we are going to address that in a separate JIRA already and I don't think we should optimize too early here. What we can do is add a check to OfflineReplica state change that doesn't do the zookeeper write and rpc request if the replica is already not in the isr.
1.5 In shutdownBroker,
      controllerContext.shuttingDownBrokerIds.add(id)
      controllerContext.liveBrokers = controllerContext.liveBrokers.filter(_.id != id)
It seems like the 2nd statement is not required since the shutting down brokers should be removed from the live brokers list anyways. This is another reason why moving it to the getter might be useful and safer.
1.6 If the shutdownBroker API throws a runtime exception, what value is returned to the admin command ?
1.7 In shutdownBroker, the controllerLock is acquired and released atleast four times. Probably you wanted to release the lock between the move for each partition in order to not block on an unimportant operation like broker shutdown. It is usually wise to release a lock for blocking operations to avoid deadlock/starvation. So, I didn't get why the lock is released for sending the stop replica request. It seems like the lock can be acquired up until the list of partitions to be moved is computed ? This acquire/release lock multiple times approach is always tricky, but right now after reading the code once, I'm not sure if this particular API would run into any race condition or not. So far, my intuition is that the only bad thing that can happen is we miss out on moving some leaders from the broker, which is not a critical operation anyway.

2. StopReplicaRequest

2.1 I think it is a little awkward to overload the stop replica request with no partitions to mean that it is not supposed to shutdown the replica fetcher thread. In the future, if we find a use case where we need to do this only for some partitions, we might need to change the stop replica request format. What do people think about adding a flag to the stop replica request that tells the broker if it should just shutdown the replica fetcher thread or delete the replica as well ?

                
> Implement clean shutdown in 0.8
> -------------------------------
>
>                 Key: KAFKA-340
>                 URL: https://issues.apache.org/jira/browse/KAFKA-340
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8
>            Reporter: Jun Rao
>            Assignee: Joel Koshy
>            Priority: Blocker
>              Labels: bugs
>         Attachments: KAFKA-340-v1.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> If we are shutting down a broker when the ISR of a partition includes only that broker, we could lose some messages that have been previously committed. For clean shutdown, we need to guarantee that there is at least 1 other broker in ISR after the broker is shut down.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira