You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2020/01/02 09:42:00 UTC

[jira] [Comment Edited] (FLINK-15456) Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

    [ https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006696#comment-17006696 ] 

Zhu Zhu edited comment on FLINK-15456 at 1/2/20 9:41 AM:
---------------------------------------------------------

[~xintongsong] Yes, it looks like the case described in FLINK-13554. 
Do you have idea how can to solve it without must risk?
I will also try to repro the issue with DEBUG logs.

cc: [~trohrmann]


was (Author: zhuzh):
This issue looks like the case described in FLINK-13554. 
[~xintongsong] do you have idea how can to solve it without must risk?

cc: [~trohrmann]

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15456
>                 URL: https://issues.apache.org/jira/browse/FLINK-15456
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: jm_part.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are registered. So the job fails due to not able to acquire all 30 slots needed in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask for new TMs even if it cannot fulfill the slot requests. So the job will keep failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)