You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Ismael Juma (JIRA)" <ji...@apache.org> on 2016/11/30 16:20:58 UTC

[jira] [Updated] (KAFKA-4418) Broker Leadership Election Fails If Missing ZK Path Raises Exception

     [ https://issues.apache.org/jira/browse/KAFKA-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ismael Juma updated KAFKA-4418:
-------------------------------
    Labels: reliability  (was: )

> Broker Leadership Election Fails If Missing ZK Path Raises Exception
> --------------------------------------------------------------------
>
>                 Key: KAFKA-4418
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4418
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Michael Pedersen
>              Labels: reliability
>
> Our Kafka cluster went down because a single node went down *and* a path in Zookeeper was missing for one topic (/brokers/topics/<topicname>/partitions). When this occurred, leadership election could not run, and produced a stack trace that looked like this:
> Failed to start preferred replica election
> org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
> 	at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> 	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)
> 	at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)
> 	at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)
> 	at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:537)
> 	at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:817)
> 	at kafka.utils.ZkUtils$$anonfun$getAllPartitions$1.apply(ZkUtils.scala:816)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> 	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at kafka.utils.ZkUtils.getAllPartitions(ZkUtils.scala:816)
> 	at kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:64)
> 	at kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/warandpeace/partitions
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
> 	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
> 	at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)
> 	at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)
> 	at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)
> 	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
> 	... 16 more
> I have checked through the code a bit, and have found a quick place to introduce a fix that would seem to allow the leadership election to continue. Specifically, the function at https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/utils/ZkUtils.scala#L633 does not handle possible exceptions. Wrapping a try/catch block here would work, but could introduce a number of other problems:
> * If the code is used elsewhere, the exception might be needed at a higher level to prevent something else.
> * Unless the exception is logged/reported somehow, no one will know this problem exists, which makes debugging other problems harder.
> I'm sure there are other issues I'm not aware of, but those two come to mind quickly. What would be the best route for getting this resolved quickly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)