You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2020/07/09 23:06:00 UTC

[jira] [Created] (KAFKA-10262) StateDirectory is not thread-safe

Sophie Blee-Goldman created KAFKA-10262:
-------------------------------------------

             Summary: StateDirectory is not thread-safe
                 Key: KAFKA-10262
                 URL: https://issues.apache.org/jira/browse/KAFKA-10262
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.6.0
            Reporter: Sophie Blee-Goldman


As explicitly stated in the StateDirectory javadocs,  "This class is not thread-safe."

Despite this, a single StateDirectory is shared among all the StreamThreads of a client. Some of the more "dangerous" methods are indeed synchronized, but others are not. For example, the innocent-sounding #directoryForTask is not thread-safe and is called in a number of places. We call it during task creation, and we call it during task closure when we check `directoryForTaskIsEmpty`. It's not uncommon for one thread to be closing a task while another is creating it after a rebalance.

In fact, we saw exactly that happen in our test application. This ultimately lead to the following exception

 
{code:java}
org.apache.kafka.streams.errors.ProcessorStateException: task directory [/mnt/run/streams/state/stream-soak-test/1_0] doesn't exist and couldn't be created at org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:112) at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:187) at org.apache.kafka.streams.processor.internals.StandbyTaskCreator.createTasks(StandbyTaskCreator.java:85) at org.apache.kafka.streams.processor.internals.TaskManager.handleAssignment(TaskManager.java:337)
{code}
 

The exception arises from this line in StateDirectory#directoryForTask:
{code:java}
if (hasPersistentStores && !taskDir.exists() && !taskDir.mkdir()) 
{code}
Presumably, if the taskDir did not exist when the two threads began this method, then they would both attempt to create the directory. One of them will get there first, leaving the other to return unsuccessfully from mkdir and ultimately throw the above ProcessorStateException.

I've only confirmed that this affects 2.6 so far, but the unsafe methods are present in earlier versions. It's possible we made the problem worse somehow during "The Refactor" so that it's easier to hit this race condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)