You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Krishna Kumar <kk...@nanigans.com> on 2015/07/09 21:04:32 UTC

ISR not a replica

Hi

We added a Kafka node and it suddenly became the leader and the sole
replica for some partitions, but it is not in the ISR

Any idea how we might be able to fix this? We are on Kafka 0.8.2

Topic: topic1 Partition: 0	Leader: 2	Replicas: 2,1,0	Isr: 2,0,1
	Topic: topic1 Partition: 1	Leader: 3	Replicas: 0,2,1	Isr: 3
	Topic: topic1 Partition: 2	Leader: 3	Replicas: 1,0,2	Isr: 3
	Topic: topic1 Partition: 3	Leader: 2	Replicas: 2,0,1	Isr: 2,0,1
	Topic: topic1 Partition: 4	Leader: 3	Replicas: 0,1,2	Isr: 3
	Topic: topic1 Partition: 5	Leader: 1	Replicas: 1,2,0	Isr: 1,2,0
	Topic: topic1 Partition: 6	Leader: 3	Replicas: 2,1,0	Isr: 3
	Topic: topic1 Partition: 7	Leader: 0	Replicas: 0,2,1	Isr: 0,1,2




>


Re: ISR not a replica

Posted by Guozhang Wang <wa...@gmail.com>.
OK, it seems your have a controller migration some time ago and the old
controller (broker 0) did not de-register its listeners while its
controller modules like "partition state machine" has been already
shutdown. You can try to verify this through the active-controller metrics.

If that is the case, you can try bounce the old controller broker, and
re-try the admin tool to see if it works now.

There are a couple of known bugs on the older version of Kafka which can
cause re-signed controllers to not de-register their ZK listeners, which
version are you using? I suggest upgrading to the latest version and see if
those issues go away.

Guozhang

On Fri, Jul 10, 2015 at 11:03 AM, Krishna Kumar <kk...@nanigans.com> wrote:

> Yes, there were messages in the controller logs such as
>
> DEBUG [OfflinePartitionLeaderSelector]: No broker in ISR is alive for
> [topic1,2]. Pick the leader from the alive assigned replicas:
> (kafka.controller.OfflinePartitionLeaderSelector)
>
> ERROR [Partition state machine on Controller 0]: Error while moving some
> partitions to NewPartition state (kafka.controller.PartitionStateMachine)
> kafka.common.StateChangeFailedException: Controller 0 epoch 0 initiated
> state change for partition [topic1,17] to NewPartition failed because the
> partition state machine has not started
>
> ERROR [AddPartitionsListener on 0]: Error while handling add partitions
> for data path /brokers/topics/topic1
> (kafka.controller.PartitionStateMachine$AddPartitionsListener)
> java.util.NoSuchElementException: key not found: [topic1,17]
>
> INFO [Controller 0]: List of topics ineligible for deletion: topic1
>
>
>
> Quite a lot of these actually
>
>
>
> On 7/10/15, 1:44 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
>
> >Krish,
> >
> >If you only add a new broker (for example broker 3) into your cluster
> >without doing anything else, this broker will not automatically get any
> >topic-partitions migrated to itself, so I suspect there are at least some
> >admin tools executed.
> >
> >The log exceptions you showed in the previous emails come from the server
> >logs, could you also check the controller logs (on broker 1 in your
> >scenario) and see if there are any exceptions / errors?
> >
> >Guozhang
> >
> >On Fri, Jul 10, 2015 at 8:09 AM, Krishna Kumar <kk...@nanigans.com>
> >wrote:
> >
> >> So we think we have a process to fix this issue via ZooKeeper ­ If
> >>anyone
> >> has any thoughts, please let me know.
> >>
> >> First get the “state” from a good partition, to get the correct epochs:
> >>
> >> In /usr/local/zookeeper/zkCli.sh
> >>
> >> [zk: localhost:2181(CONNECTED) 4] get
> >> /brokers/topics/topic1/partitions/6/state
> >>
> >>
> >>
> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
> >>0,1]}
> >>
> >> Then, as long as we are sure those brokers have replicas, we set this
> >>onto
> >> the ‘stuck’ partition (6 is unstuck, 4 is stuck):
> >>
> >> set /brokers/topics/topic1/partitions/4/state
> >>
> >>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
> >>0,1]}
> >>
> >> And run the rebalance for that partition only:
> >>
> >> su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh
> >> --zookeeper localhost:2181 --path-to-json /tmp/topic1.json"
> >>
> >> Json file:
> >>
> >> {
> >> "version":1,
> >> "partitions":[{"topic”:"topic1","partition":4}]
> >> }
> >>
> >>
> >> On 7/9/15, 8:32 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
> >> kkumar@nanigans.com>> wrote:
> >>
> >> Well, 3 (the new node) was shut down, so there were no messages there.
> >>“1"
> >> was the leader and we saw the messages on “0” and “2”.
> >>
> >> We managed to resolve this new problem to an extent by shutting down
> >>“1".
> >> We were worried because “1” was the only replica in the ISR. But once it
> >> went down, “0” and “2” entered the ISR. Then on bringing back “1”, it
> >>too
> >> added itself to ISR.
> >>
> >> We still see a few partitions in some topics that do not have all the
> >> replicas in the ISR. Hopefully, that resolves itself over the next few
> >> hours.
> >>
> >> But finally we are the same spot we were earlier. There are partitions
> >> with Leader “3” although “3” is not one of the replicas, and none of the
> >> replicas are in the ISR. We want to remove “3” as a leader and get the
> >> others working. Not sure what our options are.
> >>
> >>
> >>
> >> On 7/9/15, 8:24 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
> >> wangguoz@gmail.com>> wrote:
> >>
> >> Krish,
> >>
> >> Does broker 0 and 3 have the similar warn log entries as broker 2 for
> >> stale
> >> controller epochs?
> >>
> >> Guozhang
> >>
> >> On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar
> >><kkumar@nanigans.com<mailto:
> >> kkumar@nanigans.com>> wrote:
> >>
> >> So we tried taking that node down. But that didn¹t fix the issue, so we
> >> restarted the other nodes.
> >>
> >> This seems to have lead to 2 of other replicas dropping out of the ISIR
> >> for *all* topics.
> >>
> >> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
> >>          Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
> >> Isr: 1
> >>          Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
> >> Isr: 1
> >>          Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
> >> Isr: 1
> >>
> >>
> >> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
> >> controller 1 with correlation id 8685 since its controller epoch 21 is
> >> old. Latest known controller epoch is 89 (state.change.logger)
> >>
> >>
> >>
> >> On 7/9/15, 4:02 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
> >> kkumar@nanigans.com>> wrote:
> >>
> >> >Thanks Guozhang
> >> >
> >> >We did do the partition-assignment, but against another topic, and that
> >> >went well.
> >> >
> >> >But this happened for this topic without doing anything.
> >> >
> >> >Regards
> >> >Krish
> >> >
> >> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
> >> wangguoz@gmail.com>> wrote:
> >> >
> >> >>Krishna,
> >> >>
> >> >>Did you run any admin tools after adding the node (I assume it is node
> >> >>3),
> >> >>like partition-assignment? It is shown as the only one in ISR list but
> >> >>not
> >> >>in the replica list, which seems that the partition migration process
> >> was
> >> >>not completed.
> >> >>
> >> >>You can verify if this is the case by checking your controller log and
> >> >>see
> >> >>if there are any exception / error entries.
> >> >>
> >> >>Guozhang
> >> >>
> >> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kkumar@nanigans.com
> >> <ma...@nanigans.com>>
> >> >>wrote:
> >> >>
> >> >>> Hi
> >> >>>
> >> >>> We added a Kafka node and it suddenly became the leader and the sole
> >> >>> replica for some partitions, but it is not in the ISR
> >> >>>
> >> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
> >> >>>
> >> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
> >> >>>2,0,1
> >> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
> >> 0,2,1
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
> >> 1,0,2
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
> >> 2,0,1
> >> >>> Isr: 2,0,1
> >> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
> >> 0,1,2
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
> >> 1,2,0
> >> >>> Isr: 1,2,0
> >> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
> >> 2,1,0
> >> >>> Isr: 3
> >> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
> >> 0,2,1
> >> >>> Isr: 0,1,2
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>--
> >> >>-- Guozhang
> >> >
> >>
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >>
> >>
> >
> >
> >--
> >-- Guozhang
>
>


-- 
-- Guozhang

Re: ISR not a replica

Posted by Krishna Kumar <kk...@nanigans.com>.
Yes, there were messages in the controller logs such as

DEBUG [OfflinePartitionLeaderSelector]: No broker in ISR is alive for
[topic1,2]. Pick the leader from the alive assigned replicas:
(kafka.controller.OfflinePartitionLeaderSelector)

ERROR [Partition state machine on Controller 0]: Error while moving some
partitions to NewPartition state (kafka.controller.PartitionStateMachine)
kafka.common.StateChangeFailedException: Controller 0 epoch 0 initiated
state change for partition [topic1,17] to NewPartition failed because the
partition state machine has not started

ERROR [AddPartitionsListener on 0]: Error while handling add partitions
for data path /brokers/topics/topic1
(kafka.controller.PartitionStateMachine$AddPartitionsListener)
java.util.NoSuchElementException: key not found: [topic1,17]

INFO [Controller 0]: List of topics ineligible for deletion: topic1



Quite a lot of these actually



On 7/10/15, 1:44 PM, "Guozhang Wang" <wa...@gmail.com> wrote:

>Krish,
>
>If you only add a new broker (for example broker 3) into your cluster
>without doing anything else, this broker will not automatically get any
>topic-partitions migrated to itself, so I suspect there are at least some
>admin tools executed.
>
>The log exceptions you showed in the previous emails come from the server
>logs, could you also check the controller logs (on broker 1 in your
>scenario) and see if there are any exceptions / errors?
>
>Guozhang
>
>On Fri, Jul 10, 2015 at 8:09 AM, Krishna Kumar <kk...@nanigans.com>
>wrote:
>
>> So we think we have a process to fix this issue via ZooKeeper ­ If
>>anyone
>> has any thoughts, please let me know.
>>
>> First get the “state” from a good partition, to get the correct epochs:
>>
>> In /usr/local/zookeeper/zkCli.sh
>>
>> [zk: localhost:2181(CONNECTED) 4] get
>> /brokers/topics/topic1/partitions/6/state
>>
>>
>> 
>>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
>>0,1]}
>>
>> Then, as long as we are sure those brokers have replicas, we set this
>>onto
>> the ‘stuck’ partition (6 is unstuck, 4 is stuck):
>>
>> set /brokers/topics/topic1/partitions/4/state
>> 
>>{"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,
>>0,1]}
>>
>> And run the rebalance for that partition only:
>>
>> su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh
>> --zookeeper localhost:2181 --path-to-json /tmp/topic1.json"
>>
>> Json file:
>>
>> {
>> "version":1,
>> "partitions":[{"topic”:"topic1","partition":4}]
>> }
>>
>>
>> On 7/9/15, 8:32 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
>> kkumar@nanigans.com>> wrote:
>>
>> Well, 3 (the new node) was shut down, so there were no messages there.
>>“1"
>> was the leader and we saw the messages on “0” and “2”.
>>
>> We managed to resolve this new problem to an extent by shutting down
>>“1".
>> We were worried because “1” was the only replica in the ISR. But once it
>> went down, “0” and “2” entered the ISR. Then on bringing back “1”, it
>>too
>> added itself to ISR.
>>
>> We still see a few partitions in some topics that do not have all the
>> replicas in the ISR. Hopefully, that resolves itself over the next few
>> hours.
>>
>> But finally we are the same spot we were earlier. There are partitions
>> with Leader “3” although “3” is not one of the replicas, and none of the
>> replicas are in the ISR. We want to remove “3” as a leader and get the
>> others working. Not sure what our options are.
>>
>>
>>
>> On 7/9/15, 8:24 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
>> wangguoz@gmail.com>> wrote:
>>
>> Krish,
>>
>> Does broker 0 and 3 have the similar warn log entries as broker 2 for
>> stale
>> controller epochs?
>>
>> Guozhang
>>
>> On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar
>><kkumar@nanigans.com<mailto:
>> kkumar@nanigans.com>> wrote:
>>
>> So we tried taking that node down. But that didn¹t fix the issue, so we
>> restarted the other nodes.
>>
>> This seems to have lead to 2 of other replicas dropping out of the ISIR
>> for *all* topics.
>>
>> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
>>          Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
>> Isr: 1
>>          Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
>> Isr: 1
>>          Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
>> Isr: 1
>>
>>
>> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
>> controller 1 with correlation id 8685 since its controller epoch 21 is
>> old. Latest known controller epoch is 89 (state.change.logger)
>>
>>
>>
>> On 7/9/15, 4:02 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
>> kkumar@nanigans.com>> wrote:
>>
>> >Thanks Guozhang
>> >
>> >We did do the partition-assignment, but against another topic, and that
>> >went well.
>> >
>> >But this happened for this topic without doing anything.
>> >
>> >Regards
>> >Krish
>> >
>> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
>> wangguoz@gmail.com>> wrote:
>> >
>> >>Krishna,
>> >>
>> >>Did you run any admin tools after adding the node (I assume it is node
>> >>3),
>> >>like partition-assignment? It is shown as the only one in ISR list but
>> >>not
>> >>in the replica list, which seems that the partition migration process
>> was
>> >>not completed.
>> >>
>> >>You can verify if this is the case by checking your controller log and
>> >>see
>> >>if there are any exception / error entries.
>> >>
>> >>Guozhang
>> >>
>> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kkumar@nanigans.com
>> <ma...@nanigans.com>>
>> >>wrote:
>> >>
>> >>> Hi
>> >>>
>> >>> We added a Kafka node and it suddenly became the leader and the sole
>> >>> replica for some partitions, but it is not in the ISR
>> >>>
>> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>> >>>
>> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
>> >>>2,0,1
>> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
>> 0,2,1
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
>> 1,0,2
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
>> 2,0,1
>> >>> Isr: 2,0,1
>> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
>> 0,1,2
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
>> 1,2,0
>> >>> Isr: 1,2,0
>> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
>> 2,1,0
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
>> 0,2,1
>> >>> Isr: 0,1,2
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> >
>> >>>
>> >>>
>> >>
>> >>
>> >>--
>> >>-- Guozhang
>> >
>>
>>
>>
>>
>> --
>> -- Guozhang
>>
>>
>>
>
>
>-- 
>-- Guozhang


Re: ISR not a replica

Posted by Guozhang Wang <wa...@gmail.com>.
Krish,

If you only add a new broker (for example broker 3) into your cluster
without doing anything else, this broker will not automatically get any
topic-partitions migrated to itself, so I suspect there are at least some
admin tools executed.

The log exceptions you showed in the previous emails come from the server
logs, could you also check the controller logs (on broker 1 in your
scenario) and see if there are any exceptions / errors?

Guozhang

On Fri, Jul 10, 2015 at 8:09 AM, Krishna Kumar <kk...@nanigans.com> wrote:

> So we think we have a process to fix this issue via ZooKeeper – If anyone
> has any thoughts, please let me know.
>
> First get the “state” from a good partition, to get the correct epochs:
>
> In /usr/local/zookeeper/zkCli.sh
>
> [zk: localhost:2181(CONNECTED) 4] get
> /brokers/topics/topic1/partitions/6/state
>
>
> {"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,0,1]}
>
> Then, as long as we are sure those brokers have replicas, we set this onto
> the ‘stuck’ partition (6 is unstuck, 4 is stuck):
>
> set /brokers/topics/topic1/partitions/4/state
> {"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,0,1]}
>
> And run the rebalance for that partition only:
>
> su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh
> --zookeeper localhost:2181 --path-to-json /tmp/topic1.json"
>
> Json file:
>
> {
> "version":1,
> "partitions":[{"topic”:"topic1","partition":4}]
> }
>
>
> On 7/9/15, 8:32 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
> kkumar@nanigans.com>> wrote:
>
> Well, 3 (the new node) was shut down, so there were no messages there. “1"
> was the leader and we saw the messages on “0” and “2”.
>
> We managed to resolve this new problem to an extent by shutting down “1".
> We were worried because “1” was the only replica in the ISR. But once it
> went down, “0” and “2” entered the ISR. Then on bringing back “1”, it too
> added itself to ISR.
>
> We still see a few partitions in some topics that do not have all the
> replicas in the ISR. Hopefully, that resolves itself over the next few
> hours.
>
> But finally we are the same spot we were earlier. There are partitions
> with Leader “3” although “3” is not one of the replicas, and none of the
> replicas are in the ISR. We want to remove “3” as a leader and get the
> others working. Not sure what our options are.
>
>
>
> On 7/9/15, 8:24 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
> wangguoz@gmail.com>> wrote:
>
> Krish,
>
> Does broker 0 and 3 have the similar warn log entries as broker 2 for
> stale
> controller epochs?
>
> Guozhang
>
> On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar <kkumar@nanigans.com<mailto:
> kkumar@nanigans.com>> wrote:
>
> So we tried taking that node down. But that didn¹t fix the issue, so we
> restarted the other nodes.
>
> This seems to have lead to 2 of other replicas dropping out of the ISIR
> for *all* topics.
>
> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
>          Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
> Isr: 1
>          Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
> Isr: 1
>          Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
> Isr: 1
>
>
> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
> controller 1 with correlation id 8685 since its controller epoch 21 is
> old. Latest known controller epoch is 89 (state.change.logger)
>
>
>
> On 7/9/15, 4:02 PM, "Krishna Kumar" <kkumar@nanigans.com<mailto:
> kkumar@nanigans.com>> wrote:
>
> >Thanks Guozhang
> >
> >We did do the partition-assignment, but against another topic, and that
> >went well.
> >
> >But this happened for this topic without doing anything.
> >
> >Regards
> >Krish
> >
> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wangguoz@gmail.com<mailto:
> wangguoz@gmail.com>> wrote:
> >
> >>Krishna,
> >>
> >>Did you run any admin tools after adding the node (I assume it is node
> >>3),
> >>like partition-assignment? It is shown as the only one in ISR list but
> >>not
> >>in the replica list, which seems that the partition migration process
> was
> >>not completed.
> >>
> >>You can verify if this is the case by checking your controller log and
> >>see
> >>if there are any exception / error entries.
> >>
> >>Guozhang
> >>
> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kkumar@nanigans.com
> <ma...@nanigans.com>>
> >>wrote:
> >>
> >>> Hi
> >>>
> >>> We added a Kafka node and it suddenly became the leader and the sole
> >>> replica for some partitions, but it is not in the ISR
> >>>
> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
> >>>
> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
> >>>2,0,1
> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
> 0,2,1
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
> 1,0,2
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
> 2,0,1
> >>> Isr: 2,0,1
> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
> 0,1,2
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
> 1,2,0
> >>> Isr: 1,2,0
> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
> 2,1,0
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
> 0,2,1
> >>> Isr: 0,1,2
> >>>
> >>>
> >>>
> >>>
> >>> >
> >>>
> >>>
> >>
> >>
> >>--
> >>-- Guozhang
> >
>
>
>
>
> --
> -- Guozhang
>
>
>


-- 
-- Guozhang

Re: ISR not a replica

Posted by Krishna Kumar <kk...@nanigans.com>.
So we think we have a process to fix this issue via ZooKeeper – If anyone has any thoughts, please let me know.

First get the “state” from a good partition, to get the correct epochs:

In /usr/local/zookeeper/zkCli.sh

[zk: localhost:2181(CONNECTED) 4] get /brokers/topics/topic1/partitions/6/state

  {"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,0,1]}

Then, as long as we are sure those brokers have replicas, we set this onto the ‘stuck’ partition (6 is unstuck, 4 is stuck):

set /brokers/topics/topic1/partitions/4/state  {"controller_epoch":22,"leader":1,"version":1,"leader_epoch":55,"isr":[2,0,1]}

And run the rebalance for that partition only:

su java -c "/usr/local/kafka/bin/kafka-preferred-replica-election.sh --zookeeper localhost:2181 --path-to-json /tmp/topic1.json"

Json file:

{
"version":1,
"partitions":[{"topic”:"topic1","partition":4}]
}


On 7/9/15, 8:32 PM, "Krishna Kumar" <kk...@nanigans.com>> wrote:

Well, 3 (the new node) was shut down, so there were no messages there. “1"
was the leader and we saw the messages on “0” and “2”.

We managed to resolve this new problem to an extent by shutting down “1".
We were worried because “1” was the only replica in the ISR. But once it
went down, “0” and “2” entered the ISR. Then on bringing back “1”, it too
added itself to ISR.

We still see a few partitions in some topics that do not have all the
replicas in the ISR. Hopefully, that resolves itself over the next few
hours.

But finally we are the same spot we were earlier. There are partitions
with Leader “3” although “3” is not one of the replicas, and none of the
replicas are in the ISR. We want to remove “3” as a leader and get the
others working. Not sure what our options are.



On 7/9/15, 8:24 PM, "Guozhang Wang" <wa...@gmail.com>> wrote:

Krish,

Does broker 0 and 3 have the similar warn log entries as broker 2 for
stale
controller epochs?

Guozhang

On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar <kk...@nanigans.com>> wrote:

So we tried taking that node down. But that didn¹t fix the issue, so we
restarted the other nodes.

This seems to have lead to 2 of other replicas dropping out of the ISIR
for *all* topics.

Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
         Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
Isr: 1
         Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
Isr: 1
         Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
Isr: 1


I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
controller 1 with correlation id 8685 since its controller epoch 21 is
old. Latest known controller epoch is 89 (state.change.logger)



On 7/9/15, 4:02 PM, "Krishna Kumar" <kk...@nanigans.com>> wrote:

>Thanks Guozhang
>
>We did do the partition-assignment, but against another topic, and that
>went well.
>
>But this happened for this topic without doing anything.
>
>Regards
>Krish
>
>On 7/9/15, 3:56 PM, "Guozhang Wang" <wa...@gmail.com>> wrote:
>
>>Krishna,
>>
>>Did you run any admin tools after adding the node (I assume it is node
>>3),
>>like partition-assignment? It is shown as the only one in ISR list but
>>not
>>in the replica list, which seems that the partition migration process
was
>>not completed.
>>
>>You can verify if this is the case by checking your controller log and
>>see
>>if there are any exception / error entries.
>>
>>Guozhang
>>
>>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com>>
>>wrote:
>>
>>> Hi
>>>
>>> We added a Kafka node and it suddenly became the leader and the sole
>>> replica for some partitions, but it is not in the ISR
>>>
>>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>>>
>>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
>>>2,0,1
>>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
0,2,1
>>> Isr: 3
>>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
1,0,2
>>> Isr: 3
>>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
2,0,1
>>> Isr: 2,0,1
>>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
0,1,2
>>> Isr: 3
>>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
1,2,0
>>> Isr: 1,2,0
>>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
2,1,0
>>> Isr: 3
>>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
0,2,1
>>> Isr: 0,1,2
>>>
>>>
>>>
>>>
>>> >
>>>
>>>
>>
>>
>>--
>>-- Guozhang
>




--
-- Guozhang



Re: ISR not a replica

Posted by Krishna Kumar <kk...@nanigans.com>.
Well, 3 (the new node) was shut down, so there were no messages there. “1"
was the leader and we saw the messages on “0” and “2”.

We managed to resolve this new problem to an extent by shutting down “1".
We were worried because “1” was the only replica in the ISR. But once it
went down, “0” and “2” entered the ISR. Then on bringing back “1”, it too
added itself to ISR.

We still see a few partitions in some topics that do not have all the
replicas in the ISR. Hopefully, that resolves itself over the next few
hours.

But finally we are the same spot we were earlier. There are partitions
with Leader “3” although “3” is not one of the replicas, and none of the
replicas are in the ISR. We want to remove “3” as a leader and get the
others working. Not sure what our options are.



On 7/9/15, 8:24 PM, "Guozhang Wang" <wa...@gmail.com> wrote:

>Krish,
>
>Does broker 0 and 3 have the similar warn log entries as broker 2 for
>stale
>controller epochs?
>
>Guozhang
>
>On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar <kk...@nanigans.com> wrote:
>
>> So we tried taking that node down. But that didn¹t fix the issue, so we
>> restarted the other nodes.
>>
>> This seems to have lead to 2 of other replicas dropping out of the ISIR
>> for *all* topics.
>>
>> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
>>         Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
>> Isr: 1
>>         Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
>> Isr: 1
>>         Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
>> Isr: 1
>>
>>
>> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
>> controller 1 with correlation id 8685 since its controller epoch 21 is
>> old. Latest known controller epoch is 89 (state.change.logger)
>>
>>
>>
>> On 7/9/15, 4:02 PM, "Krishna Kumar" <kk...@nanigans.com> wrote:
>>
>> >Thanks Guozhang
>> >
>> >We did do the partition-assignment, but against another topic, and that
>> >went well.
>> >
>> >But this happened for this topic without doing anything.
>> >
>> >Regards
>> >Krish
>> >
>> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
>> >
>> >>Krishna,
>> >>
>> >>Did you run any admin tools after adding the node (I assume it is node
>> >>3),
>> >>like partition-assignment? It is shown as the only one in ISR list but
>> >>not
>> >>in the replica list, which seems that the partition migration process
>>was
>> >>not completed.
>> >>
>> >>You can verify if this is the case by checking your controller log and
>> >>see
>> >>if there are any exception / error entries.
>> >>
>> >>Guozhang
>> >>
>> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com>
>> >>wrote:
>> >>
>> >>> Hi
>> >>>
>> >>> We added a Kafka node and it suddenly became the leader and the sole
>> >>> replica for some partitions, but it is not in the ISR
>> >>>
>> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>> >>>
>> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
>> >>>2,0,1
>> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas:
>>0,2,1
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas:
>>1,0,2
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas:
>>2,0,1
>> >>> Isr: 2,0,1
>> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas:
>>0,1,2
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas:
>>1,2,0
>> >>> Isr: 1,2,0
>> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas:
>>2,1,0
>> >>> Isr: 3
>> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas:
>>0,2,1
>> >>> Isr: 0,1,2
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> >
>> >>>
>> >>>
>> >>
>> >>
>> >>--
>> >>-- Guozhang
>> >
>>
>>
>
>
>-- 
>-- Guozhang


Re: ISR not a replica

Posted by Guozhang Wang <wa...@gmail.com>.
Krish,

Does broker 0 and 3 have the similar warn log entries as broker 2 for stale
controller epochs?

Guozhang

On Thu, Jul 9, 2015 at 2:07 PM, Krishna Kumar <kk...@nanigans.com> wrote:

> So we tried taking that node down. But that didn¹t fix the issue, so we
> restarted the other nodes.
>
> This seems to have lead to 2 of other replicas dropping out of the ISIR
> for *all* topics.
>
> Topic: topic2 Partition: 0      Leader: 1       Replicas: 1,0,2 Isr: 1
>         Topic: topic2 Partition: 1      Leader: 1       Replicas: 2,1,0
> Isr: 1
>         Topic: topic2 Partition: 2      Leader: 1       Replicas: 0,2,1
> Isr: 1
>         Topic: topic2 Partition: 3      Leader: 1       Replicas: 1,2,0
> Isr: 1
>
>
> I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
> controller 1 with correlation id 8685 since its controller epoch 21 is
> old. Latest known controller epoch is 89 (state.change.logger)
>
>
>
> On 7/9/15, 4:02 PM, "Krishna Kumar" <kk...@nanigans.com> wrote:
>
> >Thanks Guozhang
> >
> >We did do the partition-assignment, but against another topic, and that
> >went well.
> >
> >But this happened for this topic without doing anything.
> >
> >Regards
> >Krish
> >
> >On 7/9/15, 3:56 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
> >
> >>Krishna,
> >>
> >>Did you run any admin tools after adding the node (I assume it is node
> >>3),
> >>like partition-assignment? It is shown as the only one in ISR list but
> >>not
> >>in the replica list, which seems that the partition migration process was
> >>not completed.
> >>
> >>You can verify if this is the case by checking your controller log and
> >>see
> >>if there are any exception / error entries.
> >>
> >>Guozhang
> >>
> >>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com>
> >>wrote:
> >>
> >>> Hi
> >>>
> >>> We added a Kafka node and it suddenly became the leader and the sole
> >>> replica for some partitions, but it is not in the ISR
> >>>
> >>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
> >>>
> >>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
> >>>2,0,1
> >>>         Topic: topic1 Partition: 1      Leader: 3       Replicas: 0,2,1
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 2      Leader: 3       Replicas: 1,0,2
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 3      Leader: 2       Replicas: 2,0,1
> >>> Isr: 2,0,1
> >>>         Topic: topic1 Partition: 4      Leader: 3       Replicas: 0,1,2
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 5      Leader: 1       Replicas: 1,2,0
> >>> Isr: 1,2,0
> >>>         Topic: topic1 Partition: 6      Leader: 3       Replicas: 2,1,0
> >>> Isr: 3
> >>>         Topic: topic1 Partition: 7      Leader: 0       Replicas: 0,2,1
> >>> Isr: 0,1,2
> >>>
> >>>
> >>>
> >>>
> >>> >
> >>>
> >>>
> >>
> >>
> >>--
> >>-- Guozhang
> >
>
>


-- 
-- Guozhang

Re: ISR not a replica

Posted by Krishna Kumar <kk...@nanigans.com>.
So we tried taking that node down. But that didn¹t fix the issue, so we
restarted the other nodes.

This seems to have lead to 2 of other replicas dropping out of the ISIR
for *all* topics.

Topic: topic2 Partition: 0	Leader: 1	Replicas: 1,0,2	Isr: 1
	Topic: topic2 Partition: 1	Leader: 1	Replicas: 2,1,0	Isr: 1
	Topic: topic2 Partition: 2	Leader: 1	Replicas: 0,2,1	Isr: 1
	Topic: topic2 Partition: 3	Leader: 1	Replicas: 1,2,0	Isr: 1


I am seeing this message => Broker 2 ignoring LeaderAndIsr request from
controller 1 with correlation id 8685 since its controller epoch 21 is
old. Latest known controller epoch is 89 (state.change.logger)



On 7/9/15, 4:02 PM, "Krishna Kumar" <kk...@nanigans.com> wrote:

>Thanks Guozhang
>
>We did do the partition-assignment, but against another topic, and that
>went well.
>
>But this happened for this topic without doing anything.
>
>Regards
>Krish
>
>On 7/9/15, 3:56 PM, "Guozhang Wang" <wa...@gmail.com> wrote:
>
>>Krishna,
>>
>>Did you run any admin tools after adding the node (I assume it is node
>>3),
>>like partition-assignment? It is shown as the only one in ISR list but
>>not
>>in the replica list, which seems that the partition migration process was
>>not completed.
>>
>>You can verify if this is the case by checking your controller log and
>>see
>>if there are any exception / error entries.
>>
>>Guozhang
>>
>>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com>
>>wrote:
>>
>>> Hi
>>>
>>> We added a Kafka node and it suddenly became the leader and the sole
>>> replica for some partitions, but it is not in the ISR
>>>
>>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>>>
>>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
>>>2,0,1
>>>         Topic: topic1 Partition: 1      Leader: 3       Replicas: 0,2,1
>>> Isr: 3
>>>         Topic: topic1 Partition: 2      Leader: 3       Replicas: 1,0,2
>>> Isr: 3
>>>         Topic: topic1 Partition: 3      Leader: 2       Replicas: 2,0,1
>>> Isr: 2,0,1
>>>         Topic: topic1 Partition: 4      Leader: 3       Replicas: 0,1,2
>>> Isr: 3
>>>         Topic: topic1 Partition: 5      Leader: 1       Replicas: 1,2,0
>>> Isr: 1,2,0
>>>         Topic: topic1 Partition: 6      Leader: 3       Replicas: 2,1,0
>>> Isr: 3
>>>         Topic: topic1 Partition: 7      Leader: 0       Replicas: 0,2,1
>>> Isr: 0,1,2
>>>
>>>
>>>
>>>
>>> >
>>>
>>>
>>
>>
>>-- 
>>-- Guozhang
>


Re: ISR not a replica

Posted by Krishna Kumar <kk...@nanigans.com>.
Thanks Guozhang

We did do the partition-assignment, but against another topic, and that
went well.

But this happened for this topic without doing anything.

Regards
Krish

On 7/9/15, 3:56 PM, "Guozhang Wang" <wa...@gmail.com> wrote:

>Krishna,
>
>Did you run any admin tools after adding the node (I assume it is node 3),
>like partition-assignment? It is shown as the only one in ISR list but not
>in the replica list, which seems that the partition migration process was
>not completed.
>
>You can verify if this is the case by checking your controller log and see
>if there are any exception / error entries.
>
>Guozhang
>
>On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com>
>wrote:
>
>> Hi
>>
>> We added a Kafka node and it suddenly became the leader and the sole
>> replica for some partitions, but it is not in the ISR
>>
>> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>>
>> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr:
>>2,0,1
>>         Topic: topic1 Partition: 1      Leader: 3       Replicas: 0,2,1
>> Isr: 3
>>         Topic: topic1 Partition: 2      Leader: 3       Replicas: 1,0,2
>> Isr: 3
>>         Topic: topic1 Partition: 3      Leader: 2       Replicas: 2,0,1
>> Isr: 2,0,1
>>         Topic: topic1 Partition: 4      Leader: 3       Replicas: 0,1,2
>> Isr: 3
>>         Topic: topic1 Partition: 5      Leader: 1       Replicas: 1,2,0
>> Isr: 1,2,0
>>         Topic: topic1 Partition: 6      Leader: 3       Replicas: 2,1,0
>> Isr: 3
>>         Topic: topic1 Partition: 7      Leader: 0       Replicas: 0,2,1
>> Isr: 0,1,2
>>
>>
>>
>>
>> >
>>
>>
>
>
>-- 
>-- Guozhang


Re: ISR not a replica

Posted by Guozhang Wang <wa...@gmail.com>.
Krishna,

Did you run any admin tools after adding the node (I assume it is node 3),
like partition-assignment? It is shown as the only one in ISR list but not
in the replica list, which seems that the partition migration process was
not completed.

You can verify if this is the case by checking your controller log and see
if there are any exception / error entries.

Guozhang

On Thu, Jul 9, 2015 at 12:04 PM, Krishna Kumar <kk...@nanigans.com> wrote:

> Hi
>
> We added a Kafka node and it suddenly became the leader and the sole
> replica for some partitions, but it is not in the ISR
>
> Any idea how we might be able to fix this? We are on Kafka 0.8.2
>
> Topic: topic1 Partition: 0      Leader: 2       Replicas: 2,1,0 Isr: 2,0,1
>         Topic: topic1 Partition: 1      Leader: 3       Replicas: 0,2,1
> Isr: 3
>         Topic: topic1 Partition: 2      Leader: 3       Replicas: 1,0,2
> Isr: 3
>         Topic: topic1 Partition: 3      Leader: 2       Replicas: 2,0,1
> Isr: 2,0,1
>         Topic: topic1 Partition: 4      Leader: 3       Replicas: 0,1,2
> Isr: 3
>         Topic: topic1 Partition: 5      Leader: 1       Replicas: 1,2,0
> Isr: 1,2,0
>         Topic: topic1 Partition: 6      Leader: 3       Replicas: 2,1,0
> Isr: 3
>         Topic: topic1 Partition: 7      Leader: 0       Replicas: 0,2,1
> Isr: 0,1,2
>
>
>
>
> >
>
>


-- 
-- Guozhang