You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by James Brown <jb...@easypost.com> on 2017/04/28 17:43:47 UTC

topics stuck in "Leader: -1" after crash while migrating topics

We're running 0.10.1.0 on a five-node cluster.

I was in the process of migrating some topics from having 2 replicas to
having three replicas when two the five machines in this cluster crashed
(brokers 2 and 3).

After restarting them, all of the topics that were previously assigned to
them are unavailable and showing "Leader: -1".

Example kafka-topics output:

% kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:

Note that I wasn't even moving any of the __consumer_offsets partitions,
so I'm not sure if the fact that a reassignment was in progress is a red
herring or not.

The logs are full of

ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
server experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
server experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition
[epostg.request_log_v1,0] to broker
3:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition
[epostg.request_log_v1,0] to broker
3:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)

What can I do to fix this? Should I manually reassign all partitions that
were led by brokers 2 or 3 to only have whatever the third broker was in
their replica-set as their replica set? Do I need to temporarily enable
unclean elections?

I've never seen a cluster fail this way...

-- 
James Brown
Engineer

Re: topics stuck in "Leader: -1" after crash while migrating topics

Posted by Ismael Juma <is...@juma.me.uk>.

There are indeed some known issues in the Controller that require care to
avoid. Onur has recently contributed a PR that simplifies the concurrency
model of the Controller:

https://github.com/apache/kafka/commit/bb663d04febcadd4f120e0ff5c5919ca8bf7e971

This is a good first step and will be part of 0.11.0.0. The next step will
be to fix the session expiration issues. It's a non-trivial amount of work
so the current target is the feature release after 0.11.0.0.

Ismael

On Fri, Apr 28, 2017 at 8:30 PM, Michal Borowiecki <
michal.borowiecki@openbet.com> wrote:

> Hi James,
>
> This "Cached zkVersion [x] not equal to that in zookeeper" issue bit us
> once in production and I found these ticket to be relevant:
> KAFKA-2729 <https://issues.apache.org/jira/browse/KAFKA-2729>
> KAFKA-3042 <https://issues.apache.org/jira/browse/KAFKA-3042>
> KAFKA-3083 <https://issues.apache.org/jira/browse/KAFKA-3083>
> Unfortunately, I don't believe there is a fix for it yet, or in the making.
>
> Thanks,
> Michał
>
>
> On 28/04/17 19:26, James Brown wrote:
>
> For what it's worth, shutting down the entire cluster and then restarting
> it did address this issue.
>
> I'd love anyone's thoughts on what the "correct" fix would be here.
>
> On Fri, Apr 28, 2017 at 10:58 AM, James Brown <jb...@easypost.com> <jb...@easypost.com> wrote:
>
>
> The following is also appearing in the logs a lot, if anyone has any ideas:
>
> INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jb...@easypost.com> <jb...@easypost.com> wrote:
>
>
> We're running 0.10.1.0 on a five-node cluster.
>
> I was in the process of migrating some topics from having 2 replicas to
> having three replicas when two the five machines in this cluster crashed
> (brokers 2 and 3).
>
> After restarting them, all of the topics that were previously assigned to
> them are unavailable and showing "Leader: -1".
>
> Example kafka-topics output:
>
> % kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
> Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
> Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
> Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
> Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
> Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
> Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:
>
> Note that I wasn't even moving any of the __consumer_offsets partitions,
> so I'm not sure if the fact that a reassignment was in progress is a red
> herring or not.
>
> The logs are full of
>
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
>
> What can I do to fix this? Should I manually reassign all partitions
> that were led by brokers 2 or 3 to only have whatever the third broker was
> in their replica-set as their replica set? Do I need to temporarily enable
> unclean elections?
>
> I've never seen a cluster fail this way...
>
> --
> James Brown
> Engineer
>
>
> --
> James Brown
> Engineer
>
>
>
> --
> <http://www.openbet.com/> Michal Borowiecki
> Senior Software Engineer L4
> T: +44 208 742 1600 <020%208742%201600>
>
>
> +44 203 249 8448 <020%203249%208448>
>
>
>
> E: michal.borowiecki@openbet.com
> W: www.openbet.com
> OpenBet Ltd
>
> Chiswick Park Building 9
>
> 566 Chiswick High Rd
>
> London
>
> W4 5XT
>
> UK
> <https://www.openbet.com/email_promo>
> This message is confidential and intended only for the addressee. If you
> have received this message in error, please immediately notify the
> postmaster@openbet.com and delete it from your system as well as any
> copies. The content of e-mails as well as traffic data may be monitored by
> OpenBet for employment and security purposes. To protect the environment
> please do not print this e-mail unless necessary. OpenBet Ltd. Registered
> Office: Chiswick Park Building 9, 566 Chiswick High Road, London, W4 5XT,
> United Kingdom. A company registered in England and Wales. Registered no.
> 3134634. VAT no. GB927523612
>

Re: topics stuck in "Leader: -1" after crash while migrating topics

Posted by Michal Borowiecki <mi...@openbet.com>.

Hi James,

This "Cached zkVersion [x] not equal to that in zookeeper" issue bit us 
once in production and I found these ticket to be relevant:
KAFKA-2729 <https://issues.apache.org/jira/browse/KAFKA-2729>
KAFKA-3042 <https://issues.apache.org/jira/browse/KAFKA-3042>
KAFKA-3083 <https://issues.apache.org/jira/browse/KAFKA-3083>

Unfortunately, I don't believe there is a fix for it yet, or in the making.

Thanks,
Micha\u0142

On 28/04/17 19:26, James Brown wrote:
> For what it's worth, shutting down the entire cluster and then restarting
> it did address this issue.
>
> I'd love anyone's thoughts on what the "correct" fix would be here.
>
> On Fri, Apr 28, 2017 at 10:58 AM, James Brown <jb...@easypost.com> wrote:
>
>> The following is also appearing in the logs a lot, if anyone has any ideas:
>>
>> INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not
>> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>>
>> On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jb...@easypost.com> wrote:
>>
>>> We're running 0.10.1.0 on a five-node cluster.
>>>
>>> I was in the process of migrating some topics from having 2 replicas to
>>> having three replicas when two the five machines in this cluster crashed
>>> (brokers 2 and 3).
>>>
>>> After restarting them, all of the topics that were previously assigned to
>>> them are unavailable and showing "Leader: -1".
>>>
>>> Example kafka-topics output:
>>>
>>> % kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
>>> Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
>>> Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
>>> Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
>>> Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
>>> Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
>>> Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
>>> Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
>>> Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
>>> Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
>>> Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
>>> Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
>>> Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
>>> Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:
>>>
>>> \u200bNote that I wasn't even moving any of the __consumer_offsets partitions,
>>> so I'm not sure if the fact that a reassignment was in progress is a red
>>> herring or not.
>>>
>>> The logs are full of
>>>
>>> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
>>> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
>>> server experienced an unexpected error when processing the request
>>> (kafka.server.ReplicaFetcherThread)
>>> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
>>> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
>>> server experienced an unexpected error when processing the request
>>> (kafka.server.ReplicaFetcherThread)
>>> ERROR [ReplicaFetcherThread-0-3], Error for partition
>>> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
>>> The server experienced an unexpected error when processing the request
>>> (kafka.server.ReplicaFetcherThread)
>>> ERROR [ReplicaFetcherThread-0-3], Error for partition
>>> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
>>> The server experienced an unexpected error when processing the request
>>> (kafka.server.ReplicaFetcherThread)\u200b
>>>
>>> \u200bWhat can I do to fix this? Should I manually reassign all partitions
>>> that were led by brokers 2 or 3 to only have whatever the third broker was
>>> in their replica-set as their replica set? Do I need to temporarily enable
>>> unclean elections?
>>>
>>> I've never seen a cluster fail this way...\u200b
>>>
>>> --
>>> James Brown
>>> Engineer
>>>
>>
>>
>> --
>> James Brown
>> Engineer
>>
>
>

-- 
Signature
<http://www.openbet.com/> 	Michal Borowiecki
Senior Software Engineer L4
	T: 	+44 208 742 1600

	
	+44 203 249 8448

	
	
	E: 	michal.borowiecki@openbet.com
	W: 	www.openbet.com <http://www.openbet.com/>

	
	OpenBet Ltd

	Chiswick Park Building 9

	566 Chiswick High Rd

	London

	W4 5XT

	UK

	
<https://www.openbet.com/email_promo>

This message is confidential and intended only for the addressee. If you 
have received this message in error, please immediately notify the 
postmaster@openbet.com <ma...@openbet.com> and delete it 
from your system as well as any copies. The content of e-mails as well 
as traffic data may be monitored by OpenBet for employment and security 
purposes. To protect the environment please do not print this e-mail 
unless necessary. OpenBet Ltd. Registered Office: Chiswick Park Building 
9, 566 Chiswick High Road, London, W4 5XT, United Kingdom. A company 
registered in England and Wales. Registered no. 3134634. VAT no. 
GB927523612

Re: topics stuck in "Leader: -1" after crash while migrating topics

Posted by James Brown <jb...@easypost.com>.

For what it's worth, shutting down the entire cluster and then restarting
it did address this issue.

I'd love anyone's thoughts on what the "correct" fix would be here.

On Fri, Apr 28, 2017 at 10:58 AM, James Brown <jb...@easypost.com> wrote:

> The following is also appearing in the logs a lot, if anyone has any ideas:
>
> INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not
> equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
>
> On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jb...@easypost.com> wrote:
>
>> We're running 0.10.1.0 on a five-node cluster.
>>
>> I was in the process of migrating some topics from having 2 replicas to
>> having three replicas when two the five machines in this cluster crashed
>> (brokers 2 and 3).
>>
>> After restarting them, all of the topics that were previously assigned to
>> them are unavailable and showing "Leader: -1".
>>
>> Example kafka-topics output:
>>
>> % kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
>> Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
>> Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
>> Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
>> Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
>> Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
>> Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
>> Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
>> Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
>> Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
>> Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
>> Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
>> Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
>> Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:
>>
>> Note that I wasn't even moving any of the __consumer_offsets partitions,
>> so I'm not sure if the fact that a reassignment was in progress is a red
>> herring or not.
>>
>> The logs are full of
>>
>> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
>> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
>> server experienced an unexpected error when processing the request
>> (kafka.server.ReplicaFetcherThread)
>> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
>> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
>> server experienced an unexpected error when processing the request
>> (kafka.server.ReplicaFetcherThread)
>> ERROR [ReplicaFetcherThread-0-3], Error for partition
>> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
>> The server experienced an unexpected error when processing the request
>> (kafka.server.ReplicaFetcherThread)
>> ERROR [ReplicaFetcherThread-0-3], Error for partition
>> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
>> The server experienced an unexpected error when processing the request
>> (kafka.server.ReplicaFetcherThread)
>>
>> What can I do to fix this? Should I manually reassign all partitions
>> that were led by brokers 2 or 3 to only have whatever the third broker was
>> in their replica-set as their replica set? Do I need to temporarily enable
>> unclean elections?
>>
>> I've never seen a cluster fail this way...
>>
>> --
>> James Brown
>> Engineer
>>
>
>
>
> --
> James Brown
> Engineer
>



-- 
James Brown
Engineer

Re: topics stuck in "Leader: -1" after crash while migrating topics

Posted by James Brown <jb...@easypost.com>.

The following is also appearing in the logs a lot, if anyone has any ideas:

INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jb...@easypost.com> wrote:

> We're running 0.10.1.0 on a five-node cluster.
>
> I was in the process of migrating some topics from having 2 replicas to
> having three replicas when two the five machines in this cluster crashed
> (brokers 2 and 3).
>
> After restarting them, all of the topics that were previously assigned to
> them are unavailable and showing "Leader: -1".
>
> Example kafka-topics output:
>
> % kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
> Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
> Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
> Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
> Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
> Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
> Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
> Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
> Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
> Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:
>
> Note that I wasn't even moving any of the __consumer_offsets partitions,
> so I'm not sure if the fact that a reassignment was in progress is a red
> herring or not.
>
> The logs are full of
>
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
> to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
> server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
> ERROR [ReplicaFetcherThread-0-3], Error for partition
> [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException:
> The server experienced an unexpected error when processing the request
> (kafka.server.ReplicaFetcherThread)
>
> What can I do to fix this? Should I manually reassign all partitions that
> were led by brokers 2 or 3 to only have whatever the third broker was in
> their replica-set as their replica set? Do I need to temporarily enable
> unclean elections?
>
> I've never seen a cluster fail this way...
>
> --
> James Brown
> Engineer
>



-- 
James Brown
Engineer