You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/02/04 10:32:00 UTC

[jira] [Updated] (FLINK-11215) TaskExecutor RegistrationTimeoutException within the specified maximum registration duration 300000ms

     [ https://issues.apache.org/jira/browse/FLINK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann updated FLINK-11215:
----------------------------------
    Component/s: Distributed Coordination

> TaskExecutor RegistrationTimeoutException within the specified maximum registration duration 300000ms
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11215
>                 URL: https://issues.apache.org/jira/browse/FLINK-11215
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>            Reporter: Liu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2018-12-25-14-50-35-348.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sometimes, job will fail after 5 minutes because register fail at resource manager.
> !https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!
> But it register successful 5 minutes ago in fact (Tag ljg is added by me for test).
> !image-2018-12-25-14-50-35-348.png!
> This problem appears for that the function startRegistrationTimeout in TaskExecutor.java is executed in multiple places.
> In the function start, it will be executed by resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in async way. Also, it will be executed in the end of the start function. The order of these two executions is not guaranteed but they will change the same variable currentRegistrationTimeoutId. If the async way is fast enough to execute startRegistrationTimeout() first. It will fail 5 minutes later for the startRegistrationTimeout's execution in the end of the start function.
> The solution is to put the function startRegistrationTimeout in the start function before resourceManagerLeaderRetriever.start() . After doing this, the problem never appears again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)