You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2016/07/13 16:25:20 UTC
[jira] [Comment Edited] (FLINK-4152) TaskManager registration exponential backoff doesn't work

    [ https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375296#comment-15375296 ] 

Till Rohrmann edited comment on FLINK-4152 at 7/13/16 4:24 PM:
---------------------------------------------------------------

The restarted registration attempts are the observable symptoms caused by a different problem. 

The actual problem is that the {{YarnFlinkRessourceManager}} forgets about the registered task managers if the job manager loses its leadership. Each task manager has a resource ID with which it registers at the resource manager. The {{YarnFlinkResourceManager}} has two states for allocated resources: {{containersInLaunch}} and {{registeredWorkers}}. A container can only go from {{containersInLaunch}} to {{registeredWorkers}}. This also works for the initial registration. However, when the job manager loses its leadership and the {{registeredWorkers}} list is cleared, there is no longer an container in launch associated with the respective resource ID. Consequently, when the old task manager is being re-registered by the new leader, the registration is rejected.

This rejection is then sent to the task manager. Upon receiving a rejection, the task manager reschedules another registration attempt after waiting for some time. Here the problem is that the old registration attempts are not cancelled. Consequently, one will have multiple registration attempts taking place at the "same" time/concurrently. That's the reason why you observe many registration attempt messages in the log.

I think the symptom can be fixed by cancelling all currently active registration attempts when you want to restart the registration.

It is a bit unclear to me what the expected behaviour of the FlinkYarnResourceManager should be. In the {{jobManagerLostLeadership}} method where the {{registeredWorkers}} list is cleared, a comment says "all currently registered TaskManagers are put under "awaiting registration"". But there is no such state. Furthermore, I'm not sure whether registered TaskManagers have to re-register if only the job manager has failed.

Thus, I see two solutions. Either not clearing {{registeredWorkers}} or introducing a new state "awaiting registration" which keeps all formerly registered task managers which can be re-registered.

Maybe [~mxm] can give some input.


was (Author: till.rohrmann):
[~mxm]The restarted registration attempts are the observable symptoms caused by a different problem. 

The actual problem is that the {{YarnFlinkRessourceManager}} forgets about the registered task managers if the job manager loses its leadership. Each task manager has a resource ID with which it registers at the resource manager. The {{YarnFlinkResourceManager}} has two states for allocated resources: {{containersInLaunch}} and {{registeredWorkers}}. A container can only go from {{containersInLaunch}} to {{registeredWorkers}}. This also works for the initial registration. However, when the job manager loses its leadership and the {{registeredWorkers}} list is cleared, there is no longer an container in launch associated with the respective resource ID. Consequently, when the old task manager is being re-registered by the new leader, the registration is rejected.

This rejection is then sent to the task manager. Upon receiving a rejection, the task manager reschedules another registration attempt after waiting for some time. Here the problem is that the old registration attempts are not cancelled. Consequently, one will have multiple registration attempts taking place at the "same" time/concurrently. That's the reason why you observe many registration attempt messages in the log.

I think the symptom can be fixed by cancelling all currently active registration attempts when you want to restart the registration.

It is a bit unclear to me what the expected behaviour of the FlinkYarnResourceManager should be. In the {{jobManagerLostLeadership}} method where the {{registeredWorkers}} list is cleared, a comment says "all currently registered TaskManagers are put under "awaiting registration"". But there is no such state. Furthermore, I'm not sure whether registered TaskManagers have to re-register if only the job manager has failed.

Thus, I see two solutions. Either not clearing {{registeredWorkers}} or introducing a new state "awaiting registration" which keeps all formerly registered task managers which can be re-registered.

Maybe [~mxm] can give some input.

> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>
>                 Key: FLINK-4152
>                 URL: https://issues.apache.org/jira/browse/FLINK-4152
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, TaskManager, YARN Client
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>         Attachments: logs.tgz
>
>
> While testing Flink 1.1 I've found that the TaskManagers are logging many messages when registering at the JobManager.
> This is the log file: https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that this is the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)