You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Xintong Song (Jira)" <ji...@apache.org> on 2020/02/10 08:00:00 UTC
[jira] [Commented] (FLINK-15959) Add TaskExecutor number option in FlinkYarnSessionCli

    [ https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033423#comment-17033423 ] 

Xintong Song commented on FLINK-15959:
--------------------------------------

Hi [~liuyufei],

Thanks for opening this ticket. The use cases sounds reasonable to me, especially regarding the load balancing.

My main concern is about blocking slot requests until minimum number of slots are registered. I'm not sure how this would affect the job / cluster startup time. It might be ok if it only affect cases where this new feature is used.

Some suggestions on the proposed approache, say if we decided to solve the problem along this direction.
- I think this is not a Yarn specific issue, but a common issue for all active deployments including k8s and mesos. Therefore, the changes should not be made in {{ YarnResourceManager }}, but rather some common codes like {{ ResourceManager }} or {{ SlotManager }}.
- Instead of the total number of task executors, I would suggest to expose the configuration option to users as the minimum number of slots. I think the slot number is more aligned with users' knowledge on the job parallelism, and the proposed "don't comlete slot request until minimum slots are registered". And by defining it as the minimum rather than total, we can always allocated more containers than the configured value if needed. If the make the default value of minimum slot number to 0, then we have the same behavior as before when the config option is not explicitly configured.

> Add TaskExecutor number option in FlinkYarnSessionCli
> -----------------------------------------------------
>
>                 Key: FLINK-15959
>                 URL: https://issues.apache.org/jira/browse/FLINK-15959
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> Flink removed `-n` option after FLIP-6, change to ResourceManager start a new worker when required. But I think maintain a TaskExecutor number option is necessary. These workers will start immediately when ResourceManager starts and would not release even if all slots are free.
> Here are some resons:
> # Users actually know how many resources are needed when run a single job, initialize all workers when cluster starts can speed up startup process.
> #  Job schedule in  topology order,  next operator won't schedule until prior execution slot allocated. The TaskExecutors will start in several batchs in some cases, it might slow down the startup speed.
> # Flink support [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out tasks evenly across all available registered TaskManagers], but it will only effect if all TMs are registered. Start all TMs at begining can slove this problem.
> *suggestion:*
> I only changed YarnResourceManager, start all container in `initialize` stage, and don't comlete slot request until minimum number of slots are registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)