You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Rajasekar Elango <re...@salesforce.com> on 2015/07/30 17:14:37 UTC

ReplicaManager recover after losing a broker

We run 5 node kafka cluster in production with replication factor is 3. If
we lose a broker for couple days and kafka-data is wiped off when it comes
back online, we had to do rolling restart of all brokers to make them
heathy.

It recovers itself for most part, FailedFetchRequests and
UnderReplicatedPartitions decreases slowly after failed broker comes
online. But after some time UnderReplicatedPartitions is flat for 2 brokers
and it never drops to zero. When I checked broker logs, I see this exception

2015-07-29 02:15:57,289 [kafka-request-handler-5] ERROR
(kafka.server.ReplicaManager)  - [Replica Manager on Broker 4]: Error when
processing fetch request for partition
[com.salesforce.mandm.ajna.Metric.puppet.system,7] offset 5627 from
follower with correlation id 2425050. Possible cause: Request for offset
5627 but we only have log segments in the range 5808 to 5985.

2015-07-29 02:15:57,289 [kafka-network-thread-6667-3] ERROR
(kafka.network.Processor)  - Closing socket for kafka-broker-host1 because
of error

kafka.common.KafkaException: This operation cannot be completed on a
complete request.

kafka-broker-host1 is the failed broker that came online.

Is this a bug or expected behavior..? Are we supposed to always do rolling
restart if kafka-data dir in one broker is wiped off?

BTW, we did not had any impact to producers or consumers, we only lost some
replications until rolling restart was done.
-- 
Thanks,
Raja.