You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Sabit (Jira)" <ji...@apache.org> on 2023/12/18 00:09:00 UTC
[jira] [Created] (KAFKA-16025) Streams StateDirectory has orphaned locks after rebalancing, blocking future rebalancing

Sabit created KAFKA-16025:
-----------------------------

             Summary: Streams StateDirectory has orphaned locks after rebalancing, blocking future rebalancing
                 Key: KAFKA-16025
                 URL: https://issues.apache.org/jira/browse/KAFKA-16025
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.4.0
         Environment: Linux
            Reporter: Sabit
         Attachments: Screenshot 1702750558363.png

Hello,

 

We are encountering an issue where during rebalancing, we see streams threads on one client get stuck in rebalancing. Upon enabling debug logs, we saw that some tasks were having issues initializing due to failure to grab a lock in the StateDirectory:

 

{{{{2023-12-14 22:51:57.352000Z stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed to lock the state directory for task 0_51; will retry}}}}

 

 

Reading the comments for TaskManager, this seems like an expected case, however in our situation, the thread would get stuck on this permanently. After looking at the logs, we came to understand that whenever tasks 0_51, 0_37, 0_107, 0_219, or 0_93 were being assigned to this client, those threads would get stuck due to being unable to grab the lock. We took a heap dump of this jvm and found that all of these tasks were being locks by StreamThread-21 (see attachment). Additionally, each of these task directories exist on the instance but are empty directories.

 

The sequence of events that occurred for us to arrive at this state is that initially, all of these tasks were being processed on the impacted client, either as active or standby tasks. We had one client drop out of the consumer group, so these tasks were rebalanced away from the client. When we try to bring up a new client to replace the one that dropped out, the impacted client cannot initialize these 5 tasks it was initially processing. Sample of one timeline:

 

{{# Task moved away from the original consumer thread}}

{{{{2023-12-13 22:45:58.240000Z stream-thread [i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Suspended running}}}}

{{{{2023-12-13 }}}}{{{{22:45:58.263000Z stream-thread [i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Closed clean}}}}

{{# Directory cleaned up?{{{}{}}}}}

{{{{2023-12-13 }}}}{{{{22:57:18.696000Z stream-thread [i-0f1a5e7a42158e04b-CleanupThread] Deleting obsolete state directory 0_51 for task 0_51 as 680455ms has elapsed (cleanup delay is 600000ms).}}}}

{{{{# Cannot initialize task when it is re-assigned to this client}}}}

{{{{2023-12-14 22:51:57.352000Z stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed to lock the state directory for task 0_51; will retry}}}}

 

{{{{Reading through the StateDirectory, it wasn't immediately obvious how we could arrive in a situation where the task is locked by a thread it hadn't been attempted to be assigned to yet, while the directory was cleaned up, but is now empty instead of being deleted. We didn't observe any filesystem issues on this client around this time either.}}}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)