You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Hemanth Savasere <he...@gmail.com> on 2022/10/13 01:27:11 UTC
Entire Kafka Connect cluster stuck because of a stuck sink connector
We have stumbled upon an issue on a running cluster with multiple
source/sink connectors:
1. One of our connectors was a JDBC sink connector connected to an SQL
Server database (using the oracle JDBC driver).
2. It turns out that the DB instance had a problem causing all queries
to be stuck forever, which in turn made the start method of the connector
hang forever.
3. After some time, the entire Kafka Connect cluster was unavailable and
the REST API was not responding giving {"error_code":500,"message":"Request
timed out"} for most requests.
4. Pausing (just before the deletion of the consumer group) or deleting
the problematic connector allowed the cluster to run normally again.
We could reproduce the same issue by adding Thread.sleep(300000) in the
start method or in the put method of the ConnectorTask.
Wanted to know if there's any wiki/documentation provided that mentions how
to handle this issue. My approach would be to throw a timeout after waiting
for a particular time period and make the connector fail fast.
--
Thanks & Regards,
Hemanth
Re: Entire Kafka Connect cluster stuck because of a stuck sink connector
Posted by Chris Egerton <fe...@gmail.com>.
Hi,
What version of Kafka Connect are you running? This sounds like a bug that
was fixed a few releases ago.
Cheers,
Chris
On Wed, Oct 12, 2022, 21:27 Hemanth Savasere <he...@gmail.com>
wrote:
> We have stumbled upon an issue on a running cluster with multiple
> source/sink connectors:
>
> 1. One of our connectors was a JDBC sink connector connected to an SQL
> Server database (using the oracle JDBC driver).
> 2. It turns out that the DB instance had a problem causing all queries
> to be stuck forever, which in turn made the start method of the
> connector
> hang forever.
> 3. After some time, the entire Kafka Connect cluster was unavailable and
> the REST API was not responding giving
> {"error_code":500,"message":"Request
> timed out"} for most requests.
> 4. Pausing (just before the deletion of the consumer group) or deleting
> the problematic connector allowed the cluster to run normally again.
>
> We could reproduce the same issue by adding Thread.sleep(300000) in the
> start method or in the put method of the ConnectorTask.
>
> Wanted to know if there's any wiki/documentation provided that mentions how
> to handle this issue. My approach would be to throw a timeout after waiting
> for a particular time period and make the connector fail fast.
>
> --
> Thanks & Regards,
> Hemanth
>