You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Anish Mashankar <an...@systeminsights.com> on 2017/03/22 05:43:43 UTC

Kafka Connect behaves weird in case of zombie Kafka brokers. Also, zombie brokers?

Hello everyone,
We are running a 5 broker Kafka v0.10.0.0 cluster on AWS. Also, the connect
api is in v0.10.0.0.
It was observed that the distributed kafka connector went into infinite
loop of log message of

(Re-)joining group connect-connect-elasticsearch-indexer.

And after a little more digging. There was another infinite loop of set of
log messages

*Discovered coordinator 1.2.3.4:9092 (id: ____ rack: null) for group x*

*Marking the coordinator 1.2.3.4:9092 <http://1.2.3.4:9092> (id: ____ rack:
null) dead for group x*

Restarting Kafka connect did not help.

Looking at zookeeper, we realized that broker 1.2.3.4 had left the
zookeeper cluster. It had happened due to a timeout when interacting with
zookeeper. The broker was also the controller. Failover of controller
happened, and the applications were fine, but few days later, we started
facing the above mentioned issue. To add to the surprise, the kafka daemon
was still running on the host but was not trying to contact zookeeper in
any time. Hence, zombie broker.

Also, a connect cluster spread across multiple hosts was not working,
however a single connector worked.

After replacing the EC2 instance for the broker 1.2.3.4, kafka connect
cluster started working fine. (same broker ID)

I am not sure if this is a bug. Kafka connect shouldn't be trying the same
broker if it is not able establish connection. We use consul for service
discovery. As broker was running and 9092 port was active, consul reported
it was working and redirected dns queries to that broker when the connector
tried to connect. We have disabled DNS caching in the java options, and
Kafka connect should've tried to connect to some other host each time it
tried to query, but instead it only tried on 1.2.3.4 broker.

Does kafka connect internally cache DNS? Is there a debugging step I am
missing here?
-- 
Anish Samir Mashankar

Re: Kafka Connect behaves weird in case of zombie Kafka brokers. Also, zombie brokers?

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
Is that node the only bootstrap broker provided? If the Connect worker was
pinned to *only* that broker, it wouldn't have any chance of recovering
correct cluster information from the healthy brokers.

It sounds like there was a separate problem as well (the broker should have
figured out it was in a bad state wrt ZK), but normally we would expect
Connect to detect the issue, mark the coordinator dead (as it did) and then
refresh metadata to figure out which broker it should be talking to now.
There are some known edge cases around how initial cluster discovery works
which *might* be able to get you stuck in this situation.

-Ewen

On Tue, Mar 21, 2017 at 10:43 PM, Anish Mashankar <an...@systeminsights.com>
wrote:

> Hello everyone,
> We are running a 5 broker Kafka v0.10.0.0 cluster on AWS. Also, the connect
> api is in v0.10.0.0.
> It was observed that the distributed kafka connector went into infinite
> loop of log message of
>
> (Re-)joining group connect-connect-elasticsearch-indexer.
>
> And after a little more digging. There was another infinite loop of set of
> log messages
>
> *Discovered coordinator 1.2.3.4:9092 (id: ____ rack: null) for group x*
>
> *Marking the coordinator 1.2.3.4:9092 <http://1.2.3.4:9092> (id: ____
> rack:
> null) dead for group x*
>
> Restarting Kafka connect did not help.
>
> Looking at zookeeper, we realized that broker 1.2.3.4 had left the
> zookeeper cluster. It had happened due to a timeout when interacting with
> zookeeper. The broker was also the controller. Failover of controller
> happened, and the applications were fine, but few days later, we started
> facing the above mentioned issue. To add to the surprise, the kafka daemon
> was still running on the host but was not trying to contact zookeeper in
> any time. Hence, zombie broker.
>
> Also, a connect cluster spread across multiple hosts was not working,
> however a single connector worked.
>
> After replacing the EC2 instance for the broker 1.2.3.4, kafka connect
> cluster started working fine. (same broker ID)
>
> I am not sure if this is a bug. Kafka connect shouldn't be trying the same
> broker if it is not able establish connection. We use consul for service
> discovery. As broker was running and 9092 port was active, consul reported
> it was working and redirected dns queries to that broker when the connector
> tried to connect. We have disabled DNS caching in the java options, and
> Kafka connect should've tried to connect to some other host each time it
> tried to query, but instead it only tried on 1.2.3.4 broker.
>
> Does kafka connect internally cache DNS? Is there a debugging step I am
> missing here?
> --
> Anish Samir Mashankar
>