You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Onur Karaman (JIRA)" <ji...@apache.org> on 2016/11/15 08:51:59 UTC

[jira] [Commented] (KAFKA-4410) KafkaController sends double the expected number of StopReplicaRequests during controlled shutdown

    [ https://issues.apache.org/jira/browse/KAFKA-4410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15666555#comment-15666555 ] 

Onur Karaman commented on KAFKA-4410:
-------------------------------------

To reproduce the bug, spin up zookeeper and two kafka brokers:
{code}
> ./bin/zookeeper-server-start.sh config/zookeeper.properties
> export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties
> export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties
{code}
Create a topic with 100 partitions replication factor 2. This should make each broker have 50 leader replicas and 50 follower replicas:
{code}
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t --partition 100 --replication-factor 2
Created topic "t".
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --describe | grep -o "Leader: [0-9]" | sort | uniq -c
  50 Leader: 0
  50 Leader: 1
{code}
Control shutdown the broker (I chose the non-controller, broker 1). The request log indicates 99, almost exactly double the number of follower replicas on broker 1.
{code}
> grep "api_key=5" logs1/kafka-request.log | wc -l
      99
{code}
The one replica which was not doubled (partition 75), had its duplicate request fail to go out because broker 1 had already begun to disconnect from the controller.
{code}
> grep "api_key=5" logs1/kafka-request.log | egrep -o "partition=\d+" | sort | uniq -c
   2 partition=1
   2 partition=11
   2 partition=13
   2 partition=15
   2 partition=17
   2 partition=19
   2 partition=21
   2 partition=23
   2 partition=25
   2 partition=27
   2 partition=29
   2 partition=3
   2 partition=31
   2 partition=33
   2 partition=35
   2 partition=37
   2 partition=39
   2 partition=41
   2 partition=43
   2 partition=45
   2 partition=47
   2 partition=49
   2 partition=5
   2 partition=51
   2 partition=53
   2 partition=55
   2 partition=57
   2 partition=59
   2 partition=61
   2 partition=63
   2 partition=65
   2 partition=67
   2 partition=69
   2 partition=7
   2 partition=71
   2 partition=73
   1 partition=75
   2 partition=77
   2 partition=79
   2 partition=81
   2 partition=83
   2 partition=85
   2 partition=87
   2 partition=89
   2 partition=9
   2 partition=91
   2 partition=93
   2 partition=95
   2 partition=97
   2 partition=99

> grep "fails to send request" logs0/controller.log
[2016-11-15 00:29:42,930] WARN [Controller-0-to-broker-1-send-thread], Controller 0 epoch 1 fails to send request {controller_id=0,controller_epoch=1,delete_partitions=false,partitions=[{topic=t,partition=75}]} to broker localhost:9091 (id: 1 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread)
{code}
Factoring in the failed StopReplicaRequest, this results in 99 + 1 = 100 StopReplicaRequests, or 2x the expected number of StopReplicaRequests.

> KafkaController sends double the expected number of StopReplicaRequests during controlled shutdown
> --------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4410
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4410
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
>
> We expect KafkaController to send one StopReplicaRequest for each follower replica on the broker undergoing controlled shutdown. Examining KafkaController.shutdownBroker, we see that this is not the case:
> 1. KafkaController.shutdownBroker itself sends a StopReplicaRequest for each follower replica
> 2. KafkaController.shutdownBroker transitions every follower replica to OfflineReplica in its call to replicaStateMachine.handleStateChanges, which also sends a StopReplicaRequest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)