You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zili Chen (Jira)" <ji...@apache.org> on 2019/11/20 01:32:00 UTC
[jira] [Assigned] (FLINK-13184) Starting a TaskExecutor blocks the
YarnResourceManager's main thread
[ https://issues.apache.org/jira/browse/FLINK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zili Chen reassigned FLINK-13184:
---------------------------------
Assignee: Yang Wang
> Starting a TaskExecutor blocks the YarnResourceManager's main thread
> --------------------------------------------------------------------
>
> Key: FLINK-13184
> URL: https://issues.apache.org/jira/browse/FLINK-13184
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.8.1, 1.9.0, 1.10.0
> Reporter: Xintong Song
> Assignee: Yang Wang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.10.0, 1.8.3, 1.9.2
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous {{NMClient}}). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see FLINK-12342) because it cannot process the {{YarnResourceManager#onContainersAllocated}} calls.
> There are different solution approaches but the end goal should be to not execute any blocking calls in the {{ResourceManager's}} main thread:
> 1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
> 2. Don't upload files (avoid blocking file system operations) and use the {{NMClientAsync}} for the communication with Yarn's {{NodeManager}}.
> 3. Upload files in a separate I/O thread and use the {{NMClientAsync}} for the communication with Yarn's {{NodeManager}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)