You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Chris Egerton (Jira)" <ji...@apache.org> on 2019/10/16 18:06:00 UTC

[jira] [Created] (KAFKA-9051) Source task source offset reads can block graceful shutdown

Chris Egerton created KAFKA-9051:
------------------------------------

             Summary: Source task source offset reads can block graceful shutdown
                 Key: KAFKA-9051
                 URL: https://issues.apache.org/jira/browse/KAFKA-9051
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
    Affects Versions: 2.2.1, 2.3.0, 2.1.1, 2.0.1, 1.1.1, 1.0.2, 2.4.0, 2.5.0
            Reporter: Chris Egerton
            Assignee: Chris Egerton


When source tasks request source offsets from the framework, this results in a call to [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79] with no timeout. In distributed workers, the future is blocked on a successful [read to the end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136] of the source offsets topic, which in turn will [poll that topic indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287] until the latest messages for every partition of that topic have been consumed.

This normally completes in a reasonable amount of time. However, if the connectivity between the Connect worker and the Kafka cluster is degraded or dropped in the middle of one of these reads, it will block until connectivity is restored and the request completes successfully.

If a task is stopped (due to a manual restart via the REST API, a rebalance, worker shutdown, etc.) while blocked on a read of source offsets during its {{start}} method, not only will it fail to gracefully stop, but the framework [will not even invoke its stop method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183] until its {{start}} method (and, as a result, the source offset read request) [has completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206]. This prevents the task from being able to clean up any resources it has allocated and can lead to OOM errors, excessive thread creation, and other problems.

 

I've confirmed that this affects every release of Connect back through 1.0 at least; I've tagged the most recent bug fix release of every major/minor version from then on in the {{Affects Version/s}} field to avoid just putting every version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)