You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/10/16 18:13:00 UTC

[jira] [Commented] (KAFKA-9051) Source task source offset reads can block graceful shutdown

    [ https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953068#comment-16953068 ] 

ASF GitHub Bot commented on KAFKA-9051:
---------------------------------------

C0urante commented on pull request #7532: KAFKA-9051: Prematurely complete source offset read requests for stopped tasks
URL: https://github.com/apache/kafka/pull/7532
 
 
   [Jira](https://issues.apache.org/jira/browse/KAFKA-9051)
   
   The changes here cause source tasks which are blocked on source offset read requests to become immediately unblocked when they are scheduled for shutdown, which should allow them to complete their `start` method (if they are blocked inside if), which in turn should allow the framework to safely invoke their `stop` method and allow them to clean up allocated resources.
   
   The source offsets returned to the task in this case may be either stale or missing entirely; however, this seems preferable to throwing an exception and potentially corrupting the state of the task badly enough that its then unable to clean up resources in its `stop` method due to, e.g., NPEs.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Source task source offset reads can block graceful shutdown
> -----------------------------------------------------------
>
>                 Key: KAFKA-9051
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9051
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>
> When source tasks request source offsets from the framework, this results in a call to [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79] with no timeout. In distributed workers, the future is blocked on a successful [read to the end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136] of the source offsets topic, which in turn will [poll that topic indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287] until the latest messages for every partition of that topic have been consumed.
> This normally completes in a reasonable amount of time. However, if the connectivity between the Connect worker and the Kafka cluster is degraded or dropped in the middle of one of these reads, it will block until connectivity is restored and the request completes successfully.
> If a task is stopped (due to a manual restart via the REST API, a rebalance, worker shutdown, etc.) while blocked on a read of source offsets during its {{start}} method, not only will it fail to gracefully stop, but the framework [will not even invoke its stop method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183] until its {{start}} method (and, as a result, the source offset read request) [has completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206]. This prevents the task from being able to clean up any resources it has allocated and can lead to OOM errors, excessive thread creation, and other problems.
>  
> I've confirmed that this affects every release of Connect back through 1.0 at least; I've tagged the most recent bug fix release of every major/minor version from then on in the {{Affects Version/s}} field to avoid just putting every version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)