You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Hemanth Savasere <he...@gmail.com> on 2022/10/13 01:27:11 UTC

Entire Kafka Connect cluster stuck because of a stuck sink connector

We have stumbled upon an issue on a running cluster with multiple
source/sink connectors:

   1. One of our connectors was a JDBC sink connector connected to an SQL
   Server database (using the oracle JDBC driver).
   2. It turns out that the DB instance had a problem causing all queries
   to be stuck forever, which in turn made the start method of the connector
   hang forever.
   3. After some time, the entire Kafka Connect cluster was unavailable and
   the REST API was not responding giving {"error_code":500,"message":"Request
   timed out"} for most requests.
   4. Pausing (just before the deletion of the consumer group) or deleting
   the problematic connector allowed the cluster to run normally again.

We could reproduce the same issue by adding Thread.sleep(300000) in the
start method or in the put method of the ConnectorTask.

Wanted to know if there's any wiki/documentation provided that mentions how
to handle this issue. My approach would be to throw a timeout after waiting
for a particular time period and make the connector fail fast.

-- 
Thanks & Regards,
Hemanth

Re: Entire Kafka Connect cluster stuck because of a stuck sink connector

Posted by Chris Egerton <fe...@gmail.com>.
Hi,

What version of Kafka Connect are you running? This sounds like a bug that
was fixed a few releases ago.

Cheers,

Chris

On Wed, Oct 12, 2022, 21:27 Hemanth Savasere <he...@gmail.com>
wrote:

> We have stumbled upon an issue on a running cluster with multiple
> source/sink connectors:
>
>    1. One of our connectors was a JDBC sink connector connected to an SQL
>    Server database (using the oracle JDBC driver).
>    2. It turns out that the DB instance had a problem causing all queries
>    to be stuck forever, which in turn made the start method of the
> connector
>    hang forever.
>    3. After some time, the entire Kafka Connect cluster was unavailable and
>    the REST API was not responding giving
> {"error_code":500,"message":"Request
>    timed out"} for most requests.
>    4. Pausing (just before the deletion of the consumer group) or deleting
>    the problematic connector allowed the cluster to run normally again.
>
> We could reproduce the same issue by adding Thread.sleep(300000) in the
> start method or in the put method of the ConnectorTask.
>
> Wanted to know if there's any wiki/documentation provided that mentions how
> to handle this issue. My approach would be to throw a timeout after waiting
> for a particular time period and make the connector fail fast.
>
> --
> Thanks & Regards,
> Hemanth
>