You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Konstantine Karantasis (Jira)" <ji...@apache.org> on 2021/03/23 06:02:00 UTC

[jira] [Created] (KAFKA-12525) Inaccurate task status due to status record interleaving in fast rebalances in Connect

Konstantine Karantasis created KAFKA-12525:
----------------------------------------------

             Summary: Inaccurate task status due to status record interleaving in fast rebalances in Connect
                 Key: KAFKA-12525
                 URL: https://issues.apache.org/jira/browse/KAFKA-12525
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.6.1, 2.7.0, 2.5.1, 2.4.1, 2.3.1
            Reporter: Konstantine Karantasis
            Assignee: Konstantine Karantasis


When a task is stopped in Connect it produces an {{UNASSIGNED}} status record. 
Equivalently, when a task is started or restarted in Connect it produces an {{RUNNING}} status record in the Connect status topic.

At the same time rebalances are decoupled from task start and stop. These operations happen in separate executor outside of the main worker thread that performs the rebalance.

Normally, any delayed and stale {{UNASSIGNED}} status records are fenced by the worker that is sending them. This worker is using the {{StatusBackingStore#putSafe}} method that will reject any stale status messages (called only for {{UNASSIGNED}} or {{FAILED}}) as long as the worker is aware of the newer status record that declares a task as {{RUNNING}}.

In cases of fast consecutive rebalances where a task is revoked from one worker and assigned to another one, it has been observed that there is a small time window and thus a race condition during which a {{RUNNING}} status record in the new generation is produced and is immediately followed by a delayed {{UNASSIGNED}} status record belonging to the same or a previous generation before the worker that sends this message reads the {{RUNNING}} status record that corresponds to the latest generation.

A couple of options are available to remediate this race condition. 
For example a worker that is has started a task can re-write the {{RUNNING}} status message in the topic if it reads a stale {{UNASSIGNED}} message from a previous generation (that should have been fenced). 
Another option is to ignore stale {{UNASSIGNED}} message (messages from an earlier generation than the one in which the task had {{RUNNING}} status).

Worth noting that when this race condition takes place, besides the inaccurate status representation, the actual execution of the tasks remains unaffected (e.g. the tasks are running correctly even though they appear as {{UNASSIGNED}}). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)