You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Joe Ammann (JIRA)" <ji...@apache.org> on 2019/03/23 09:31:00 UTC

[jira] [Created] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

Joe Ammann created KAFKA-8151:
---------------------------------

             Summary: Broker hangs and lockups after Zookeeper outages
                 Key: KAFKA-8151
                 URL: https://issues.apache.org/jira/browse/KAFKA-8151
             Project: Kafka
          Issue Type: Bug
          Components: controller, core, zkclient
    Affects Versions: 2.1.1
            Reporter: Joe Ammann


We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK nodes always very quickly reunite and build a Quorum after the situation clears, but the Kafka brokers (which run on then same Linux VMs) quite often show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in quotes, I can never predict what will happen)

# the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some reason (that's the problem I originally described)
# the brokers all re-register and re-elect a new controller. But that new controller does not fully work. For example it doesn't process partition reassignment requests and or does not transfer partition leadership after I kill a broker
# the previous controller gets "dead-locked" (it has 3-4 of the important controller threads in a lock) and hence does not perform any of it's controller duties. But it regards itsself still as the valid controller and is accepted by the other brokers

I'll try to describe each one of the problems in more detail below, and hope to be able to cleary separate them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)