You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Paul Mackles <pm...@adobe.com> on 2013/09/23 02:42:14 UTC

broker giving up partition leadership for no apparent reason

With 0.8, we have a situation where a broker is removing itself (or being removed) as a leader for no apparent reason. The cluster has 3 nodes. In this case, broker id=1 stopped leading. This is what I see in the server.log at the time it stopped leading:

[2013-09-22 14:00:06,141] INFO re-registering broker info in ZK for broker 1 (kafka.server.KafkaZooKeeper)
[2013-09-22 14:00:06,507] INFO Registered broker 1 at path /brokers/ids/1 with address 10.27.63.37:9092. (kafka.utils.ZkUtils$)
[2013-09-22 14:00:06,508] INFO done re-registering broker (kafka.server.KafkaZooKeeper)
[2013-09-22 14:00:06,509] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaZooKeeper)
[2013-09-22 14:00:06,515] INFO Closing socket connection to /10.27.63.37. (kafka.network.Processor)
[2013-09-22 14:00:06,519] INFO conflict in /controller data: 1 stored data: 2 (kafka.utils.ZkUtils$)
[2013-09-22 14:00:06,526] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

The broker process itself stayed up and I was able to get it back to leading by simply running the preferred-replica-election tool. Looking at server.log, controller.log and state-change.log on all 3 brokers, it's unclear what triggered this. I thought it might be a problem communicating with ZK but I don't see any such errors. The broker had been running fine for several days prior to this. I looked at the gc logs and I don't see any long running garbage collection at that time.

What else should I be looking for?

Thanks,
Paul

Re: broker giving up partition leadership for no apparent reason

Posted by Neha Narkhede <ne...@gmail.com>.

The logs show that the broker had to re-register its broker information in
zookeeper. That would mean its previous registration was lost. It could be
GC on the broker or some issue on zookeeper side. It will help of you send
around the log4j log before the re-registration.

Another thing that will help is to send around the output of the state
change log merger tool
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool

Thanks,
Neha
On Sep 22, 2013 5:42 PM, "Paul Mackles" <pm...@adobe.com> wrote:

> With 0.8, we have a situation where a broker is removing itself (or being
> removed) as a leader for no apparent reason. The cluster has 3 nodes. In
> this case, broker id=1 stopped leading. This is what I see in the
> server.log at the time it stopped leading:
>
> [2013-09-22 14:00:06,141] INFO re-registering broker info in ZK for broker
> 1 (kafka.server.KafkaZooKeeper)
> [2013-09-22 14:00:06,507] INFO Registered broker 1 at path /brokers/ids/1
> with address 10.27.63.37:9092. (kafka.utils.ZkUtils$)
> [2013-09-22 14:00:06,508] INFO done re-registering broker
> (kafka.server.KafkaZooKeeper)
> [2013-09-22 14:00:06,509] INFO Subscribing to /brokers/topics path to
> watch for new topics (kafka.server.KafkaZooKeeper)
> [2013-09-22 14:00:06,515] INFO Closing socket connection to /10.27.63.37.
> (kafka.network.Processor)
> [2013-09-22 14:00:06,519] INFO conflict in /controller data: 1 stored
> data: 2 (kafka.utils.ZkUtils$)
> [2013-09-22 14:00:06,526] INFO New leader is 2
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
>
> The broker process itself stayed up and I was able to get it back to
> leading by simply running the preferred-replica-election tool. Looking at
> server.log, controller.log and state-change.log on all 3 brokers, it's
> unclear what triggered this. I thought it might be a problem communicating
> with ZK but I don't see any such errors. The broker had been running fine
> for several days prior to this. I looked at the gc logs and I don't see any
> long running garbage collection at that time.
>
> What else should I be looking for?
>
> Thanks,
> Paul
>
>