You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Roesler (Jira)" <ji...@apache.org> on 2020/04/09 20:33:00 UTC

[jira] [Commented] (KAFKA-9846) Race condition can lead to severe lag underestimate for active tasks

    [ https://issues.apache.org/jira/browse/KAFKA-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079987#comment-17079987 ] 

John Roesler commented on KAFKA-9846:
-------------------------------------

Thanks for the report [~ableegoldman] . I'd offer one clarifying comment. The only strict correctness guarantee available is to disable querying stale stores in StoryQueryParameters, which would be unaffected by this race condition, and in fact, such use cases wouldn't even check lags, only the ownership of the active replica.

Additionally, I'd like to add that there's no way to reason strictly about the relationship between the reported lag and the actual lag at the time of a subsequent query. Even if the reported lag is correctly zero, the store may become arbitrarily laggy by the time of a subsequent query.

That said, this bug is clearly a violation of the method's contract, which may overestimate lagginess, but never underestimate it. The fact that we might underestimate by saying that it's fully caught up when it's not even initialized makes it seem that much more egregious.

I've looked into the code base, and I believe that this bug was incidentally fixed by refactoring in trunk, so it would only affect the 2.5 branch.

> Race condition can lead to severe lag underestimate for active tasks
> --------------------------------------------------------------------
>
>                 Key: KAFKA-9846
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9846
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.5.0
>            Reporter: Sophie Blee-Goldman
>            Priority: Major
>
> In KIP-535 we added the ability to query still-restoring and standby tasks. To give users control over how out of date the data they fetch can be, we added an API to KafkaStreams that fetches the end offsets for all changelog partitions and computes the lag for each local state store.
> During this lag computation, we check whether an active task is in RESTORING and calculate the actual lag if so. If not, we assume it's in RUNNING and return a lag of zero. However, tasks may be in other states besides running and restoring; notably they first pass through the CREATED state before getting to RESTORING. A CREATED task may happen to be caught-up to the end offset, but in many cases it is likely to be lagging or even completely uninitialized.
> This introduces a race condition where users may be led to believe that a task has zero lag and is "safe" to query even with the strictest correctness guarantees, while the task is actually lagging by some unknown amount.  During transfer of ownership of the task between different threads on the same machine, tasks can actually spend a while in CREATED while the new owner waits to acquire the task directory lock. So, this race condition may not be particularly rare in multi-threaded Streams applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)