You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by James Cheng <jc...@tivo.com> on 2016/03/01 23:45:30 UTC

Unavailable partitions (Leader: -1 and ISR is empty) and we can't figure out how to get them back online

Hi,

We have 44 partitions in our cluster that are unavailable. kafka-topics.sh is reporting them with Leader: -1, and with no brokers in the ISR. Zookeeper says that broker 5 should be the partition leader for this topic partition. These are topics with replication-factor 1. Most of the topics have little to no data in them, so they are low-traffic topics. We currently cannot produce to them. And I can't find anything in the logs that seems to explain why the broker is not taking the partitions back online. Can anyone help?

Relevant log lines are attached below.

Questions:
* What does Leader: -1 mean?
* Why doesn't the broker take the partition back online?
* Is there more debugging/logging that I can turn on?
* unclean.leader.election.enable=false right now, although during a previous boot of the broker, we set it to true to get some partitions back online. These ones never came back online.

Thanks,
-James

in zookeeper
-------
$ get /brokers/topics/the.topic.name
{"version":1,"partitions":{"0":[5]}}

server.log
--------
[2016-03-01 06:29:13,869] WARN Found an corrupted index file, /TivoData/kafka/the.topic.name-0/00000000000000000000.index, deleting and rebuilding index... (kafka.log.Log)
[2016-03-01 06:29:13,870] INFO Recovering unflushed segment 0 in log the.topic.name-0. (kafka.log.Log)
[2016-03-01 06:29:13,870] INFO Completed load of log the.topic.name-0 with log end offset 0 (kafka.log.Log)


state-change.log
---------
state-change.log.2016-03-01-06:[2016-03-01 06:34:20,498] TRACE Broker 5 cached leader info (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20),ReplicationFactor:1),AllReplicas:5) for partition [the.topic.name,0] in response to UpdateMetadata request sent by controller 2 epoch 20 with correlation id 0 (state.change.logger)
state-change.log.2016-03-01-06:[2016-03-01 06:34:20,695] TRACE Broker 5 received LeaderAndIsr request (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20),ReplicationFactor:1),AllReplicas:5) correlation id 1 from controller 2 epoch 20 for partition [the.topic.name,0] (state.change.logger)
state-change.log.2016-03-01-06:[2016-03-01 06:34:20,957] TRACE Broker 5 handling LeaderAndIsr request correlationId 1 from controller 2 epoch 20 starting the become-follower transition for partition [the.topic.name,0] (state.change.logger)
state-change.log.2016-03-01-06:[2016-03-01 06:34:23,531] ERROR Broker 5 received LeaderAndIsrRequest with correlation id 1 from controller 2 epoch 20 for partition [the.topic.name,0] but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
state-change.log.2016-03-01-06:[2016-03-01 06:34:30,075] TRACE Broker 5 completed LeaderAndIsr request correlationId 1 from controller 2 epoch 20 for the become-follower transition for partition [the.topic.name,0] (state.change.logger)
state-change.log.2016-03-01-06:[2016-03-01 06:34:30,458] TRACE Broker 5 cached leader info (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20),ReplicationFactor:1),AllReplicas:5) for partition [the.topic.name,0] in response to UpdateMetadata request sent by controller 2 epoch 20 with correlation id 2 (state.change.logger)

state-change.log on the controller:
[2016-03-01 06:34:15,077] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 2 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,145] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 5 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,200] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20),ReplicationFactor:1),AllReplicas:5) for partition [the.topic.name,0] in response to UpdateMetadata request sent by controller 2 epoch 20 with correlation id 9144 (state.change.logger)
[2016-03-01 06:34:15,276] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 4 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,418] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 1 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,484] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 3 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,585] TRACE Controller 2 epoch 20 changed state of replica 5 for partition [the.topic.name,0] from OfflineReplica to OnlineReplica (state.change.logger)
[2016-03-01 06:34:15,606] TRACE Controller 2 epoch 20 sending become-follower LeaderAndIsr request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 5 for partition [the.topic.name,0] (state.change.logger)
[2016-03-01 06:34:15,645] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 2 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,775] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 5 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,787] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20),ReplicationFactor:1),AllReplicas:5) for partition [the.topic.name,0] in response to UpdateMetadata request sent by controller 2 epoch 20 with correlation id 9145 (state.change.logger)
[2016-03-01 06:34:15,838] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 4 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,879] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 1 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,909] TRACE Controller 2 epoch 20 sending UpdateMetadata request (Leader:-1,ISR:,LeaderEpoch:56,ControllerEpoch:20) to broker 3 for partition the.topic.name-0 (state.change.logger)
[2016-03-01 06:34:15,972] TRACE Controller 2 epoch 20 started leader election for partition [the.topic.name,0] (state.change.logger)
[2016-03-01 06:34:15,973] ERROR Controller 2 epoch 20 initiated state change for partition [the.topic.name,0] from OfflinePartition to OnlinePartition failed (state.change.logger)
kafka.common.NoReplicaOnlineException: No broker in ISR for partition [the.topic.name,0] is alive. Live brokers are: [Set(5, 1, 2, 3, 4)], ISR brokers are: []
[2016-03-01 06:34:30,427] TRACE Controller 2 epoch 20 received response {error_code=0,partitions=[{topic=the.topic.name=0,error_code=0}]} for a request sent to broker Node(5, core05.tec1.tivo.com, 9092) (state.change.logger)
[2016-03-01 06:38:41,552] TRACE Controller 2 epoch 20 started leader election for partition [the.topic.name,0] (state.change.logger)
[2016-03-01 06:38:41,554] ERROR Controller 2 epoch 20 encountered error while electing leader for partition [the.topic.name,0] due to: Preferred replica 5 for partition [the.topic.name,0] is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":56,"isr":[]}]. (state.change.logger)
[2016-03-01 06:38:41,554] ERROR Controller 2 epoch 20 initiated state change for partition [the.topic.name,0] from OfflinePartition to OnlinePartition failed (state.change.logger)
kafka.common.StateChangeFailedException: encountered error while electing leader for partition [the.topic.name,0] due to: Preferred replica 5 for partition [the.topic.name,0] is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":56,"isr":[]}].
Caused by: kafka.common.StateChangeFailedException: Preferred replica 5 for partition [the.topic.name,0] is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":56,"isr":[]}]


________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.