You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Aravind Velamur Srinivasan (JIRA)" <ji...@apache.org> on 2017/12/15 21:35:00 UTC

[jira] [Created] (KAFKA-6374) Constant Consumer Errors after replacing a broker

Aravind Velamur Srinivasan created KAFKA-6374:
-------------------------------------------------

             Summary: Constant Consumer Errors after replacing a broker
                 Key: KAFKA-6374
                 URL: https://issues.apache.org/jira/browse/KAFKA-6374
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.10.2.1
         Environment: OS: linux
Broker Instances: EC2 (r4.xlarge)
Storage: EBS (HDD st1 - 16T size)
Client: golang (sarama and sarama-cluster libraries)
Cluster Size: 5 nodes
Kafka Version: 0.10.2.1
ZooKeeper: 3 nodes (separate from the brokers)
            Reporter: Aravind Velamur Srinivasan


We had to replace one of the brokers for maintenance reasons. We did the following to replace the broker:
(1) Gracefully stop the Kafka broker (id: 48)
(2) Make sure producers/consumers were fine (the consumers coordinated by this broker now were managed by another broker and things were fine)
(3) Spin up a new instance with the same IP
(4) Make sure the new instance's config is the same as old with the same broker ID.
(5) Bring the new one back up.

It took ~35 to 40 mins to do this. But once the broker came back up, the consumer groups coordinated by this broker were getting constant errors that this CG is not coordinated by this broker for nearly 30 to 40 mins until i stopped the broker again.

Looks like the metadata kept returning that the coordinator for this CG is the same old broker (id 48) even after the client kept on asking for the coordinator.

(1) Are there any known issues/recent fixes for this?
(2) Why didn't the metadata refresh? Any ideas on what could be happening?

We were constantly getting errors when trying to fetch the coordinator like this:
'
sarama-logger : client/coordinator requesting coordinator for consumergroup from  (some other broker)
sarama-logger : client/coordinator coordinator for consumergroup is #48
kafka- Error: kafka server: Request was for a consumer group that is not coordinated by this broker.
'

In the kafka broker saw lots of errors like this:
'
[2017-12-13 00:38:49,559] ERROR [ReplicaFetcherThread-0-48], Error for partition [__consumer_offsets,37] to broker 48:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
'

Is it running into the stale metadata like this:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design#Kafka0.9ConsumerRewriteDesign-Co-ordinatorfailoverorconnectionlosstotheco-ordinator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)