You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhilong Hong (Jira)" <ji...@apache.org> on 2021/01/29 08:50:00 UTC

[jira] [Created] (FLINK-21201) Creating BoundedBlockingSubpartition blocks TaskManager’s main thread

Zhilong Hong created FLINK-21201:
------------------------------------

             Summary: Creating BoundedBlockingSubpartition blocks TaskManager’s main thread
                 Key: FLINK-21201
                 URL: https://issues.apache.org/jira/browse/FLINK-21201
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.12.1
            Reporter: Zhilong Hong
         Attachments: jobmanager.log.tar.gz, taskmanager.log.tar.gz

When we are trying to run batch jobs with 8k parallelism, it takes a long time to deploy the vertices. After the investigation, we find that creating BoundedBlockingSubpartition blocks TaskManager’s main thread during the procedure of {{submitTask}}. 

When JobMaster invokes {{submitTask}} and sends an RPC call to the TaskManager, the TaskManager will receive the RPC call and execute the {{submitTask}} method in its main thread. In the {{submitTask}} method, the TaskExecutor will create a Task instance and try to start it. During the creation, the TaskExecutor will create the ResultPartition and its ResultSubpartitions. 

For the batch job, the type of ResultSubpartitions is the BoundedBlockingSubpartition with the FileChannelBoundedData. The BoundedBlockingSubpartition will create a file on the local disk, which is an IO operation and could take a long time. 

In our test, it would take at most 28 seconds to create 8k BoundedBlockingSubpartitions. This procedure blocks the main thread of the TaskManager, and would lead to heartbeat timeout and slow task deploying. In my opinion, the IO operation should be executed with IOExecutor rather than the main thread. 

The log of JobManager and TaskManager is attached below. A typical task is Source 0: #898.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)