You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Szymon Sobczak <sz...@getbase.com> on 2015/10/20 06:52:25 UTC

It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Hi!

We're running a 5-machine production Kafka cluster on version 0.8.1.1.
Yesterday we had some disk problems on one of the replicas and decided to
replace that node with a clean one. That's when we started experiencing
many different problems:

- partition replicas are still assigned to the old node and we can't remove
it form the replica list
- replicas are lagging behind, most of the topics have only one ISR
- most of the leaders are on a single node
- CPU load on the machines is constantly high

We've tried to rebalance the cluster by moving the leaders, decreasing
number of replicas and some others, but it doesn't seem to help. In the
meantime I've noticed very weird errors in the kafka.log

For partition 0 of topic product_templates with the following description:

Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
135,68,163
Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
155,68,164

On machine 135 (which is a leader of product_templates,0) in kafka.log I
see:

kafka.common.NotAssignedReplicaException: Leader 135 failed to record
follower 155's position 0 for partition [product_templates,0] since the
replica 155 is not recognized to be one of the assigned replicas 68,163,135
for partition [product_templates,0]

And the complimentary, on 155 - NOT a replica product_templates,0:

ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
partition [product_templates,0] to broker 135:class
kafka.common.UnknownException

Both of those happen for multiple topics, on multiple machines. Every
single one happens multiple times per second...

How to approach this? Any help is appreciated!

Thanks!
Szymon.

Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Posted by Manish Sharma <ma...@gmail.com>.

Same here..
Started running into similar situations almost weekly ever since we
increased the number of partitions on some topics from 6 to 15 and added 3
brokers to our kafka cluster.

Last night I stopped all producers & consumers; restarted the brokers &
zookeepers and restarted producers/consumers.

Today morning I see an endless loop of shrinking ISR -> cached zkVersion
mismatch skip udating ISR again.

*[2015-11-07 11:55:47,260] INFO Partition [Wmt_Saturday_234,10] on broker
0: Shrinking ISR for partition [Wmt_Saturday_234,10] from 0,1 to 0
(kafka.cluster.Partition)*

*[2015-11-07 11:55:47,267] INFO Partition [Wmt_Saturday_234,10] on broker
0: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)*




On Tue, Oct 20, 2015 at 9:45 AM, Shaun Senecal <sh...@lithium.com>
wrote:

> I can't say this is the same issue, but it sounds similar to a situation
> we experienced with Kafka 0.8.2.[1-2].  After restarting a broker, the
> cluster would never really recover (ISRs constantly changing, replication
> failing, etc).  We found the only way to fully recover the cluster was to
> stop all producers and consumers, restart the kafka cluster, the once the
> cluster was back up, restart the producers/consumers.  Obviously thats not
> acceptable for a production cluster, but that was the only thing we could
> find that would get us going again.
>
>
> Shaun
>
> ________________________________________
> From: Szymon Sobczak <sz...@getbase.com>
> Sent: October 19, 2015 9:52 PM
> To: users@kafka.apache.org
> Cc: Big Data
> Subject: It's 5.41am, we're after 20+ hours of debugging our prod cluster.
> See NotAssignedReplicaException and UnknownException errors. Help?
>
> Hi!
>
> We're running a 5-machine production Kafka cluster on version 0.8.1.1.
> Yesterday we had some disk problems on one of the replicas and decided to
> replace that node with a clean one. That's when we started experiencing
> many different problems:
>
> - partition replicas are still assigned to the old node and we can't remove
> it form the replica list
> - replicas are lagging behind, most of the topics have only one ISR
> - most of the leaders are on a single node
> - CPU load on the machines is constantly high
>
> We've tried to rebalance the cluster by moving the leaders, decreasing
> number of replicas and some others, but it doesn't seem to help. In the
> meantime I've noticed very weird errors in the kafka.log
>
> For partition 0 of topic product_templates with the following description:
>
> Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
> Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
> 135,68,163
> Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
> 155,68,164
>
> On machine 135 (which is a leader of product_templates,0) in kafka.log I
> see:
>
> kafka.common.NotAssignedReplicaException: Leader 135 failed to record
> follower 155's position 0 for partition [product_templates,0] since the
> replica 155 is not recognized to be one of the assigned replicas 68,163,135
> for partition [product_templates,0]
>
> And the complimentary, on 155 - NOT a replica product_templates,0:
>
> ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
> kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
> partition [product_templates,0] to broker 135:class
> kafka.common.UnknownException
>
> Both of those happen for multiple topics, on multiple machines. Every
> single one happens multiple times per second...
>
> How to approach this? Any help is appreciated!
>
> Thanks!
> Szymon.
>

Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Posted by Shaun Senecal <sh...@lithium.com>.

I can't say this is the same issue, but it sounds similar to a situation we experienced with Kafka 0.8.2.[1-2].  After restarting a broker, the cluster would never really recover (ISRs constantly changing, replication failing, etc).  We found the only way to fully recover the cluster was to stop all producers and consumers, restart the kafka cluster, the once the cluster was back up, restart the producers/consumers.  Obviously thats not acceptable for a production cluster, but that was the only thing we could find that would get us going again.

Shaun

________________________________________
From: Szymon Sobczak <sz...@getbase.com>
Sent: October 19, 2015 9:52 PM
To: users@kafka.apache.org
Cc: Big Data
Subject: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Hi!

We're running a 5-machine production Kafka cluster on version 0.8.1.1.
Yesterday we had some disk problems on one of the replicas and decided to
replace that node with a clean one. That's when we started experiencing
many different problems:

- partition replicas are still assigned to the old node and we can't remove
it form the replica list
- replicas are lagging behind, most of the topics have only one ISR
- most of the leaders are on a single node
- CPU load on the machines is constantly high

We've tried to rebalance the cluster by moving the leaders, decreasing
number of replicas and some others, but it doesn't seem to help. In the
meantime I've noticed very weird errors in the kafka.log

For partition 0 of topic product_templates with the following description:

Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
135,68,163
Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
155,68,164

On machine 135 (which is a leader of product_templates,0) in kafka.log I
see:

kafka.common.NotAssignedReplicaException: Leader 135 failed to record
follower 155's position 0 for partition [product_templates,0] since the
replica 155 is not recognized to be one of the assigned replicas 68,163,135
for partition [product_templates,0]

And the complimentary, on 155 - NOT a replica product_templates,0:

ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
partition [product_templates,0] to broker 135:class
kafka.common.UnknownException

Both of those happen for multiple topics, on multiple machines. Every
single one happens multiple times per second...

How to approach this? Any help is appreciated!

Thanks!
Szymon.

Re: It's 5.41am, we're after 20+ hours of debugging our prod cluster. See NotAssignedReplicaException and UnknownException errors. Help?

Posted by Szymon Sobczak <sz...@getbase.com>.

What I tried so far:

- reassigning leader to other machine:
   - found a partition, where not a first replica was a leader and the
error appeared
   - ran the kafka-preferred-replica-election.sh script for that partition
   - checked the logs of the new leader - the same NotAssignedReplicaException
errors started appearing there
   - checked logs of the stubborn non-replica - the same UnknownException
was appearing, but it included the new leader

- adding the stubborn follower to Replicas
   - ran kafka-reassign-partitions.sh script adding it to Replicas
   - ran kafka-topics.sh  --describe to make sure it's added - it was
   - checked logs of the stubborn non-replica - the same UnknownException
was appearing
   - checked leader logs - now I see bigger errors -
http://pastebin.com/uSRrXa8A, related to other partition, causing the
entire request to fail

Now I cannot undo adding 155 to the partitions list - I ran
kafka-reassign-partitions.sh
with the original description of the partition and now running --verify
returns:

Status of partition reassignment:
ERROR: Assigned replicas (135,163,68,155) don't match the list of replicas
for reassignment (135,163,68) for partition [product_templates,0]
Reassignment of partition [product_templates,0] failed

Why can this fail?

Thanks for looking!
S.


On Mon, Oct 19, 2015 at 9:52 PM, Szymon Sobczak <sz...@getbase.com>
wrote:

> Hi!
>
> We're running a 5-machine production Kafka cluster on version 0.8.1.1.
> Yesterday we had some disk problems on one of the replicas and decided to
> replace that node with a clean one. That's when we started experiencing
> many different problems:
>
> - partition replicas are still assigned to the old node and we can't
> remove it form the replica list
> - replicas are lagging behind, most of the topics have only one ISR
> - most of the leaders are on a single node
> - CPU load on the machines is constantly high
>
> We've tried to rebalance the cluster by moving the leaders, decreasing
> number of replicas and some others, but it doesn't seem to help. In the
> meantime I've noticed very weird errors in the kafka.log
>
> For partition 0 of topic product_templates with the following description:
>
> Topic:product_templates PartitionCount:2 ReplicationFactor:3 Configs:
> Topic: product_templates Partition: 0 Leader: 135 Replicas: 135,163,68 Isr:
> 135,68,163
> Topic: product_templates Partition: 1 Leader: 155 Replicas: 163,68,164 Isr:
> 155,68,164
>
> On machine 135 (which is a leader of product_templates,0) in kafka.log I
> see:
>
> kafka.common.NotAssignedReplicaException: Leader 135 failed to record
> follower 155's position 0 for partition [product_templates,0] since the
> replica 155 is not recognized to be one of the assigned replicas 68,163,135
> for partition [product_templates,0]
>
> And the complimentary, on 155 - NOT a replica product_templates,0:
>
> ERROR [ReplicaFetcherThread-0-135] 2015-10-20 04:41:47,011 Logging.scala
> kafka.server.ReplicaFetcherThread [ReplicaFetcherThread-0-135], Error for
> partition [product_templates,0] to broker 135:class
> kafka.common.UnknownException
>
> Both of those happen for multiple topics, on multiple machines. Every
> single one happens multiple times per second...
>
> How to approach this? Any help is appreciated!
>
> Thanks!
> Szymon.
>