You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Wes Chow <we...@chartbeat.com> on 2015/04/21 18:16:26 UTC

partition reassignment stuck

I started a partition reassignment (this is a 8.1.1 cluster) some time 
ago and it seems to be stuck. Partitions are no longer getting moved 
around, but it seems like the cluster is operational otherwise. The 
stuck nodes have a lot of 00000000000000000000.index files, and their 
logs show errors like:

[2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR 
kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error 
for partition [pings,227] to broker 28:class kafka.common.UnknownException

I'm at a loss as to what I should be looking at. Any ideas?

Thanks,
Wes

Re: partition reassignment stuck

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

Hard to say, but if you have producers keeping producing data and they
work well then probably you don¹t need to.

On 4/21/15, 5:34 PM, "Wesley Chow" <we...@chartbeat.com> wrote:

>There is only one broker that thinks it's the controller right now.  The
>double controller situation happened earlier this morning. Do the other
>brokers have to be bounced after the controller situation is fixed? I did
>not do that for all brokers.
>
>Wes
> On Apr 21, 2015 8:25 PM, "Jiangjie Qin" <jq...@linkedin.com.invalid>
>wrote:
>
>>  Yes, should be broker 25 thread 0 from the log.
>> This needs to be resolved, you might need to bounce both of the brokers
>> who think itself as controller respectively. The new controller should
>>be
>> able to continue the partition reassignment.
>>
>>   From: Wes Chow <we...@chartbeat.com>
>> Reply-To: "users@kafka.apache.org" <us...@kafka.apache.org>
>> Date: Tuesday, April 21, 2015 at 1:29 PM
>> To: "users@kafka.apache.org" <us...@kafka.apache.org>
>> Subject: Re: partition reassignment stuck
>>
>>
>> Quick clarification: you say broker 0, but do you actually mean broker
>>25?
>> 25 one of the replicas for the partition, is currently the one having
>> trouble getting into sync, and 28 is the leader for the partition.
>>
>> Unfortunately, the logs of rotated off so I can't get to what happened
>> around then. However there was a time period of a few hours where we had
>> two brokers that both believed they were controllers. I'm not sure why I
>> didn't think of this before.
>>
>> ZooKeeper data appears to be inconsistent at the moment.
>> /brokers/topics/click_engage says that partition 116's replica set is:
>>[4,
>> 7, 25]. /brokers/topics/click_engage/partitions/116/state says the
>>leader
>> is 28 and the ISR is [28, 15]. Does this need to be resolved, and if so
>>how?
>>
>> Thanks,
>> Wes
>>
>>   Jiangjie Qin <jq...@linkedin.com.INVALID>
>> April 21, 2015 at 2:24 PM
>>   This means that the broker 0 thought broker 28 was leader for that
>> partition but broker 28 has actually already received StopReplicaRequest
>> from controller and stopped serving as a replica for that partition.
>> This might happen transiently but broker 0 will be able to find the new
>> leader for the partition once it receive LeaderAndIsrRequest from
>> controller to update the new leader information. If these messages keep
>>got
>> logged for long time then there might be an issue.
>> Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on
>> broker 28 to see if there is some error log. The error log might not
>>have
>> partition info included.
>>
>>   From: Wes Chow <we...@chartbeat.com>
>> Reply-To: "users@kafka.apache.org" <us...@kafka.apache.org>
>> Date: Tuesday, April 21, 2015 at 10:50 AM
>> To: "users@kafka.apache.org" <us...@kafka.apache.org>
>> Subject: Re: partition reassignment stuck
>>
>>
>> Not for that particular partition, but I am seeing these errors on 28:
>>
>> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
>> follower 25's position 0 for partition [click_engage,116] since the
>>replica
>> 25 is not recognized to be one of the assigned r
>> eplicas  for partition [click_engage,116]
>>         at
>> 
>>kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:2
>>31)
>>         at
>> 
>>kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:4
>>32)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:460)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:458)
>>         at
>> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>>         at 
>>kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>>         at
>> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> What does this mean?
>>
>> Thanks!
>> Wes
>>
>>
>>     Wes Chow <we...@chartbeat.com>
>> April 21, 2015 at 1:50 PM
>>
>> Not for that particular partition, but I am seeing these errors on 28:
>>
>> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
>> follower 25's position 0 for partition [click_engage,116] since the
>>replica
>> 25 is not recognized to be one of the assigned r
>> eplicas  for partition [click_engage,116]
>>         at
>> 
>>kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:2
>>31)
>>         at
>> 
>>kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:4
>>32)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:460)
>>         at
>> 
>>kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.
>>scala:458)
>>         at
>> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> 
>>scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>>         at
>> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>>         at 
>>kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>>         at
>> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> What does this mean?
>>
>> Thanks!
>> Wes
>>
>>
>>    Jiangjie Qin <jq...@linkedin.com.INVALID>
>> April 21, 2015 at 1:19 PM
>>   Those 00000000000000000000.index files are for different partitions
>>and
>> they should be generated if new replicas is assigned to the broker.
>> We might want to know what caused the UnknownException. Did you see any
>> error log on broker 28?
>>
>> Jiangjie (Becket) Qin
>>
>>
>>    Wes Chow <we...@chartbeat.com>
>> April 21, 2015 at 12:16 PM
>>   I started a partition reassignment (this is a 8.1.1 cluster) some time
>> ago and it seems to be stuck. Partitions are no longer getting moved
>> around, but it seems like the cluster is operational otherwise. The
>>stuck
>> nodes have a lot of 00000000000000000000.index files, and their logs
>>show
>> errors like:
>>
>> [2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR
>> kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error
>>for
>> partition [pings,227] to broker 28:class kafka.common.UnknownException
>>
>> I'm at a loss as to what I should be looking at. Any ideas?
>>
>> Thanks,
>> Wes
>>
>>

Re: partition reassignment stuck

Posted by Wesley Chow <we...@chartbeat.com>.

There is only one broker that thinks it's the controller right now.  The
double controller situation happened earlier this morning. Do the other
brokers have to be bounced after the controller situation is fixed? I did
not do that for all brokers.

Wes
 On Apr 21, 2015 8:25 PM, "Jiangjie Qin" <jq...@linkedin.com.invalid> wrote:

>  Yes, should be broker 25 thread 0 from the log.
> This needs to be resolved, you might need to bounce both of the brokers
> who think itself as controller respectively. The new controller should be
> able to continue the partition reassignment.
>
>   From: Wes Chow <we...@chartbeat.com>
> Reply-To: "users@kafka.apache.org" <us...@kafka.apache.org>
> Date: Tuesday, April 21, 2015 at 1:29 PM
> To: "users@kafka.apache.org" <us...@kafka.apache.org>
> Subject: Re: partition reassignment stuck
>
>
> Quick clarification: you say broker 0, but do you actually mean broker 25?
> 25 one of the replicas for the partition, is currently the one having
> trouble getting into sync, and 28 is the leader for the partition.
>
> Unfortunately, the logs of rotated off so I can't get to what happened
> around then. However there was a time period of a few hours where we had
> two brokers that both believed they were controllers. I'm not sure why I
> didn't think of this before.
>
> ZooKeeper data appears to be inconsistent at the moment.
> /brokers/topics/click_engage says that partition 116's replica set is: [4,
> 7, 25]. /brokers/topics/click_engage/partitions/116/state says the leader
> is 28 and the ISR is [28, 15]. Does this need to be resolved, and if so how?
>
> Thanks,
> Wes
>
>   Jiangjie Qin <jq...@linkedin.com.INVALID>
> April 21, 2015 at 2:24 PM
>   This means that the broker 0 thought broker 28 was leader for that
> partition but broker 28 has actually already received StopReplicaRequest
> from controller and stopped serving as a replica for that partition.
> This might happen transiently but broker 0 will be able to find the new
> leader for the partition once it receive LeaderAndIsrRequest from
> controller to update the new leader information. If these messages keep got
> logged for long time then there might be an issue.
> Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on
> broker 28 to see if there is some error log. The error log might not have
> partition info included.
>
>   From: Wes Chow <we...@chartbeat.com>
> Reply-To: "users@kafka.apache.org" <us...@kafka.apache.org>
> Date: Tuesday, April 21, 2015 at 10:50 AM
> To: "users@kafka.apache.org" <us...@kafka.apache.org>
> Subject: Re: partition reassignment stuck
>
>
> Not for that particular partition, but I am seeing these errors on 28:
>
> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
> follower 25's position 0 for partition [click_engage,116] since the replica
> 25 is not recognized to be one of the assigned r
> eplicas  for partition [click_engage,116]
>         at
> kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
>         at
> kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
>         at
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
>         at
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
>         at
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at
> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>         at
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>         at java.lang.Thread.run(Thread.java:745)
>
> What does this mean?
>
> Thanks!
> Wes
>
>
>     Wes Chow <we...@chartbeat.com>
> April 21, 2015 at 1:50 PM
>
> Not for that particular partition, but I am seeing these errors on 28:
>
> kafka.common.NotAssignedReplicaException: Leader 28 failed to record
> follower 25's position 0 for partition [click_engage,116] since the replica
> 25 is not recognized to be one of the assigned r
> eplicas  for partition [click_engage,116]
>         at
> kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
>         at
> kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
>         at
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
>         at
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
>         at
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at
> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>         at
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>         at java.lang.Thread.run(Thread.java:745)
>
> What does this mean?
>
> Thanks!
> Wes
>
>
>    Jiangjie Qin <jq...@linkedin.com.INVALID>
> April 21, 2015 at 1:19 PM
>   Those 00000000000000000000.index files are for different partitions and
> they should be generated if new replicas is assigned to the broker.
> We might want to know what caused the UnknownException. Did you see any
> error log on broker 28?
>
> Jiangjie (Becket) Qin
>
>
>    Wes Chow <we...@chartbeat.com>
> April 21, 2015 at 12:16 PM
>   I started a partition reassignment (this is a 8.1.1 cluster) some time
> ago and it seems to be stuck. Partitions are no longer getting moved
> around, but it seems like the cluster is operational otherwise. The stuck
> nodes have a lot of 00000000000000000000.index files, and their logs show
> errors like:
>
> [2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR
> kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error for
> partition [pings,227] to broker 28:class kafka.common.UnknownException
>
> I'm at a loss as to what I should be looking at. Any ideas?
>
> Thanks,
> Wes
>
>

Re: partition reassignment stuck

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

Yes, should be broker 25 thread 0 from the log.
This needs to be resolved, you might need to bounce both of the brokers who think itself as controller respectively. The new controller should be able to continue the partition reassignment.

From: Wes Chow <we...@chartbeat.com>>
Reply-To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Date: Tuesday, April 21, 2015 at 1:29 PM
To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Subject: Re: partition reassignment stuck

Quick clarification: you say broker 0, but do you actually mean broker 25? 25 one of the replicas for the partition, is currently the one having trouble getting into sync, and 28 is the leader for the partition.

Unfortunately, the logs of rotated off so I can't get to what happened around then. However there was a time period of a few hours where we had two brokers that both believed they were controllers. I'm not sure why I didn't think of this before.

ZooKeeper data appears to be inconsistent at the moment. /brokers/topics/click_engage says that partition 116's replica set is: [4, 7, 25]. /brokers/topics/click_engage/partitions/116/state says the leader is 28 and the ISR is [28, 15]. Does this need to be resolved, and if so how?

Thanks,
Wes
[cid:part1.03010908.07060808@chartbeat.com]
Jiangjie Qin<ma...@linkedin.com.INVALID>
April 21, 2015 at 2:24 PM
This means that the broker 0 thought broker 28 was leader for that partition but broker 28 has actually already received StopReplicaRequest from controller and stopped serving as a replica for that partition.
This might happen transiently but broker 0 will be able to find the new leader for the partition once it receive LeaderAndIsrRequest from controller to update the new leader information. If these messages keep got logged for long time then there might be an issue.
Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on broker 28 to see if there is some error log. The error log might not have partition info included.

From: Wes Chow <we...@chartbeat.com>>
Reply-To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Date: Tuesday, April 21, 2015 at 10:50 AM
To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Subject: Re: partition reassignment stuck

Not for that particular partition, but I am seeing these errors on 28:

kafka.common.NotAssignedReplicaException: Leader 28 failed to record follower 25's position 0 for partition [click_engage,116] since the replica 25 is not recognized to be one of the assigned r
eplicas  for partition [click_engage,116]
        at kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
        at kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
        at java.lang.Thread.run(Thread.java:745)

What does this mean?

Thanks!
Wes

[cid:part2.06060804.01090904@chartbeat.com]
Wes Chow<ma...@chartbeat.com>
April 21, 2015 at 1:50 PM

Not for that particular partition, but I am seeing these errors on 28:

kafka.common.NotAssignedReplicaException: Leader 28 failed to record follower 25's position 0 for partition [click_engage,116] since the replica 25 is not recognized to be one of the assigned r
eplicas  for partition [click_engage,116]
        at kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
        at kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
        at java.lang.Thread.run(Thread.java:745)

What does this mean?

Thanks!
Wes

[cid:part1.03010908.07060808@chartbeat.com]
Jiangjie Qin<ma...@linkedin.com.INVALID>
April 21, 2015 at 1:19 PM
Those 00000000000000000000.index files are for different partitions and
they should be generated if new replicas is assigned to the broker.
We might want to know what caused the UnknownException. Did you see any
error log on broker 28?

Jiangjie (Becket) Qin

[cid:part2.06060804.01090904@chartbeat.com]
Wes Chow<ma...@chartbeat.com>
April 21, 2015 at 12:16 PM
I started a partition reassignment (this is a 8.1.1 cluster) some time ago and it seems to be stuck. Partitions are no longer getting moved around, but it seems like the cluster is operational otherwise. The stuck nodes have a lot of 00000000000000000000.index files, and their logs show errors like:

[2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error for partition [pings,227] to broker 28:class kafka.common.UnknownException

I'm at a loss as to what I should be looking at. Any ideas?

Thanks,
Wes

Re: partition reassignment stuck

Posted by Wes Chow <we...@chartbeat.com>.

Quick clarification: you say broker 0, but do you actually mean broker 
25? 25 one of the replicas for the partition, is currently the one 
having trouble getting into sync, and 28 is the leader for the partition.

Unfortunately, the logs of rotated off so I can't get to what happened 
around then. However there was a time period of a few hours where we had 
two brokers that both believed they were controllers. I'm not sure why I 
didn't think of this before.

ZooKeeper data appears to be inconsistent at the moment. 
/brokers/topics/click_engage says that partition 116's replica set is: 
[4, 7, 25]. /brokers/topics/click_engage/partitions/116/state says the 
leader is 28 and the ISR is [28, 15]. Does this need to be resolved, and 
if so how?

Thanks,
Wes
> Jiangjie Qin <ma...@linkedin.com.INVALID>
> April 21, 2015 at 2:24 PM
> This means that the broker 0 thought broker 28 was leader for that 
> partition but broker 28 has actually already received 
> StopReplicaRequest from controller and stopped serving as a replica 
> for that partition.
> This might happen transiently but broker 0 will be able to find the 
> new leader for the partition once it receive LeaderAndIsrRequest from 
> controller to update the new leader information. If these messages 
> keep got logged for long time then there might be an issue.
> Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on 
> broker 28 to see if there is some error log. The error log might not 
> have partition info included.
>
> From: Wes Chow <wes@chartbeat.com <ma...@chartbeat.com>>
> Reply-To: "users@kafka.apache.org <ma...@kafka.apache.org>" 
> <users@kafka.apache.org <ma...@kafka.apache.org>>
> Date: Tuesday, April 21, 2015 at 10:50 AM
> To: "users@kafka.apache.org <ma...@kafka.apache.org>" 
> <users@kafka.apache.org <ma...@kafka.apache.org>>
> Subject: Re: partition reassignment stuck
>
>
> Not for that particular partition, but I am seeing these errors on 28:
>
> kafka.common.NotAssignedReplicaException: Leader 28 failed to record 
> follower 25's position 0 for partition [click_engage,116] since the 
> replica 25 is not recognized to be one of the assigned r
> eplicas  for partition [click_engage,116]
>         at 
> kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
>         at 
> kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
>         at 
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
>         at 
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
>         at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>         at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at 
> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>         at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>         at java.lang.Thread.run(Thread.java:745)
>
> What does this mean?
>
> Thanks!
> Wes
>
>
> Wes Chow <ma...@chartbeat.com>
> April 21, 2015 at 1:50 PM
>
> Not for that particular partition, but I am seeing these errors on 28:
>
> kafka.common.NotAssignedReplicaException: Leader 28 failed to record 
> follower 25's position 0 for partition [click_engage,116] since the 
> replica 25 is not recognized to be one of the assigned r
> eplicas  for partition [click_engage,116]
>         at 
> kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
>         at 
> kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
>         at 
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
>         at 
> kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
>         at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
>         at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
>         at 
> kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
>         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
>         at 
> kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
>         at java.lang.Thread.run(Thread.java:745)
>
> What does this mean?
>
> Thanks!
> Wes
>
>
> Jiangjie Qin <ma...@linkedin.com.INVALID>
> April 21, 2015 at 1:19 PM
> Those 00000000000000000000.index files are for different partitions and
> they should be generated if new replicas is assigned to the broker.
> We might want to know what caused the UnknownException. Did you see any
> error log on broker 28?
>
> Jiangjie (Becket) Qin
>
>
> Wes Chow <ma...@chartbeat.com>
> April 21, 2015 at 12:16 PM
> I started a partition reassignment (this is a 8.1.1 cluster) some time 
> ago and it seems to be stuck. Partitions are no longer getting moved 
> around, but it seems like the cluster is operational otherwise. The 
> stuck nodes have a lot of 00000000000000000000.index files, and their 
> logs show errors like:
>
> [2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR 
> kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], 
> Error for partition [pings,227] to broker 28:class 
> kafka.common.UnknownException
>
> I'm at a loss as to what I should be looking at. Any ideas?
>
> Thanks,
> Wes
>

Re: partition reassignment stuck

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

This means that the broker 0 thought broker 28 was leader for that partition but broker 28 has actually already received StopReplicaRequest from controller and stopped serving as a replica for that partition.
This might happen transiently but broker 0 will be able to find the new leader for the partition once it receive LeaderAndIsrRequest from controller to update the new leader information. If these messages keep got logged for long time then there might be an issue.
Can you maybe check the timestamp around [2015-04-21 12:15:36,585] on broker 28 to see if there is some error log. The error log might not have partition info included.

From: Wes Chow <we...@chartbeat.com>>
Reply-To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Date: Tuesday, April 21, 2015 at 10:50 AM
To: "users@kafka.apache.org<ma...@kafka.apache.org>" <us...@kafka.apache.org>>
Subject: Re: partition reassignment stuck

Not for that particular partition, but I am seeing these errors on 28:

kafka.common.NotAssignedReplicaException: Leader 28 failed to record follower 25's position 0 for partition [click_engage,116] since the replica 25 is not recognized to be one of the assigned r
eplicas  for partition [click_engage,116]
        at kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
        at kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
        at kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
        at kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
        at java.lang.Thread.run(Thread.java:745)

What does this mean?

Thanks!
Wes

[cid:part1.08040305.06010608@chartbeat.com]
Jiangjie Qin<ma...@linkedin.com.INVALID>
April 21, 2015 at 1:19 PM
Those 00000000000000000000.index files are for different partitions and
they should be generated if new replicas is assigned to the broker.
We might want to know what caused the UnknownException. Did you see any
error log on broker 28?

Jiangjie (Becket) Qin

[cid:part2.02070705.06050804@chartbeat.com]
Wes Chow<ma...@chartbeat.com>
April 21, 2015 at 12:16 PM
I started a partition reassignment (this is a 8.1.1 cluster) some time ago and it seems to be stuck. Partitions are no longer getting moved around, but it seems like the cluster is operational otherwise. The stuck nodes have a lot of 00000000000000000000.index files, and their logs show errors like:

[2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error for partition [pings,227] to broker 28:class kafka.common.UnknownException

I'm at a loss as to what I should be looking at. Any ideas?

Thanks,
Wes

Re: partition reassignment stuck

Posted by Wes Chow <we...@chartbeat.com>.

Not for that particular partition, but I am seeing these errors on 28:

kafka.common.NotAssignedReplicaException: Leader 28 failed to record 
follower 25's position 0 for partition [click_engage,116] since the 
replica 25 is not recognized to be one of the assigned r
eplicas  for partition [click_engage,116]
         at 
kafka.cluster.Partition.updateLeaderHWAndMaybeExpandIsr(Partition.scala:231)
         at 
kafka.server.ReplicaManager.recordFollowerPosition(ReplicaManager.scala:432)
         at 
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:460)
         at 
kafka.server.KafkaApis$$anonfun$maybeUpdatePartitionHw$2.apply(KafkaApis.scala:458)
         at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:176)
         at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
         at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:345)
         at 
kafka.server.KafkaApis.maybeUpdatePartitionHw(KafkaApis.scala:458)
         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:424)
         at kafka.server.KafkaApis.handle(KafkaApis.scala:186)
         at 
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
         at java.lang.Thread.run(Thread.java:745)

What does this mean?

Thanks!
Wes


> Jiangjie Qin <ma...@linkedin.com.INVALID>
> April 21, 2015 at 1:19 PM
> Those 00000000000000000000.index files are for different partitions and
> they should be generated if new replicas is assigned to the broker.
> We might want to know what caused the UnknownException. Did you see any
> error log on broker 28?
>
> Jiangjie (Becket) Qin
>
>
> Wes Chow <ma...@chartbeat.com>
> April 21, 2015 at 12:16 PM
> I started a partition reassignment (this is a 8.1.1 cluster) some time 
> ago and it seems to be stuck. Partitions are no longer getting moved 
> around, but it seems like the cluster is operational otherwise. The 
> stuck nodes have a lot of 00000000000000000000.index files, and their 
> logs show errors like:
>
> [2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR 
> kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], 
> Error for partition [pings,227] to broker 28:class 
> kafka.common.UnknownException
>
> I'm at a loss as to what I should be looking at. Any ideas?
>
> Thanks,
> Wes
>

Re: partition reassignment stuck

Posted by Jiangjie Qin <jq...@linkedin.com.INVALID>.

Those 00000000000000000000.index files are for different partitions and
they should be generated if new replicas is assigned to the broker.
We might want to know what caused the UnknownException. Did you see any
error log on broker 28?

Jiangjie (Becket) Qin

On 4/21/15, 9:16 AM, "Wes Chow" <we...@chartbeat.com> wrote:

>I started a partition reassignment (this is a 8.1.1 cluster) some time
>ago and it seems to be stuck. Partitions are no longer getting moved
>around, but it seems like the cluster is operational otherwise. The
>stuck nodes have a lot of 00000000000000000000.index files, and their
>logs show errors like:
>
>[2015-04-21 12:15:36,585] 3237789 [ReplicaFetcherThread-0-28] ERROR
>kafka.server.ReplicaFetcherThread  - [ReplicaFetcherThread-0-28], Error
>for partition [pings,227] to broker 28:class kafka.common.UnknownException
>
>I'm at a loss as to what I should be looking at. Any ideas?
>
>Thanks,
>Wes
>