You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Ryanne Dolan (Jira)" <ji...@apache.org> on 2021/04/30 20:10:00 UTC
[jira] [Updated] (KAFKA-12726) misbehaving Task.stop() can prevent
other Tasks from stopping
[ https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryanne Dolan updated KAFKA-12726:
---------------------------------
Description:
We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck in a retry loop). Despite Connect supporting a property task.shutdown.graceful.timeout.ms, this is currently not enforced – tasks can take as long as they want to stop, and the only consequence is an error message.
We've seen a Worker's "task-count" metric double following a rebalance, which we think is due to Tasks not getting cleaned up when Task.stop() is stuck.
While the Connector implementation is ultimately to blame here – a Task probably shouldn't loop forever in stop() – we believe the Connect runtime should handle this situation more gracefully.
was:
We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck in a retry loop). Despite Connect supporting a property task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks can take as long as they want to stop, and the only consequence is an error message.
Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can prevent any further Tasks from stopping. Moreover, after a rebalance, these lingering tasks can persist along with their replacements. For example, we've seen a Worker's "task-count" metric double following a rebalance.
While the Connector implementation is ultimately to blame here -- a Task probably shouldn't loop forever in stop() -- we believe the Connect runtime should handle this situation more gracefully.
> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
> Key: KAFKA-12726
> URL: https://issues.apache.org/jira/browse/KAFKA-12726
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 2.8.0
> Reporter: Ryanne Dolan
> Assignee: Ryanne Dolan
> Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck in a retry loop). Despite Connect supporting a property task.shutdown.graceful.timeout.ms, this is currently not enforced – tasks can take as long as they want to stop, and the only consequence is an error message.
> We've seen a Worker's "task-count" metric double following a rebalance, which we think is due to Tasks not getting cleaned up when Task.stop() is stuck.
> While the Connector implementation is ultimately to blame here – a Task probably shouldn't loop forever in stop() – we believe the Connect runtime should handle this situation more gracefully.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)