You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/04/25 13:23:00 UTC

[jira] [Comment Edited] (FLINK-27354) JobMaster still processes requests while terminating

    [ https://issues.apache.org/jira/browse/FLINK-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527486#comment-17527486 ] 

Matthias Pohl edited comment on FLINK-27354 at 4/25/22 1:22 PM:
----------------------------------------------------------------

The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}} (see [JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]). The behavior is as described in the issue description: The {{JobMaster}} is deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about the disconnect. The {{JobMaster}} will try to reconnect to the {{ResourceManager}}. The {{StandaloneResourceManager}} is able to process the RPC calls by returning a "{{RpcConnectionException: Could not connect to rpc endpoint under address}}" error after some time resulting in the repetition of {{"Registering job manager [...] failed}}".

Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection (see [JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]). The initial invoke is triggered with a quite small timeout of 100ms (derived from [cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]). This fails and we end up in a exponentially growing error handling (see [RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]). The timeout grows exponentially (see [RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297]) because that's how timeouts are handled. This can be observed in the logs as well and explains the multiple log messages.


was (Author: mapohl):
The retry mechanism is scheduled using the {{rpcService}} of the {{JobMaster}} (see [JobManster:1291|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1291]). The behavior is as described in the issue description: The {{JobMaster}} is deregistered in the {{ResourceManager}}. The RM informs the {{JobMaster}} about the disconnect. The {{JobMaster}} will try to reconnect to the {{ResourceManager}}.

Internally, a {{RetryingRegistration}} is used in the ResourceManagerConnection (see [JobMaster:1285|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1285]). The initial invoke is triggered with a quite small timeout of 100ms (derived from [cluster.registration.initial-timeout|https://github.com/apache/flink/blob/e921c4c34b5497f4ba723ddae58750f6778069fa/flink-core/src/main/java/org/apache/flink/configuration/ClusterOptions.java#L41]). This fails and we end up in a exponentially growing error handling (see [RetryingRegistration:281|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L281]). The timeout grows exponentially (see [RetryingRegistration:297|https://github.com/apache/flink/blob/582941b0f13d1cc51077e0e69fd100afe080779f/flink-runtime/src/main/java/org/apache/flink/runtime/registration/RetryingRegistration.java#L297]) because that's how timeouts are handled. This can be observed in the logs as well and explains the multiple log messages.

> JobMaster still processes requests while terminating
> ----------------------------------------------------
>
>                 Key: FLINK-27354
>                 URL: https://issues.apache.org/jira/browse/FLINK-27354
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.6, 1.14.4
>            Reporter: Matthias Pohl
>            Priority: Major
>         Attachments: flink-logs.zip
>
>
> An issue was reported in the [user ML|https://lists.apache.org/thread/5pm3crntmb1hl17h4txnlhjz34clghrg] about the JobMaster trying to reconnect to the ResourceManager during shutdown.
> The JobMaster is disconnecting from the ResourceManager during shutdown (see [JobMaster:1182|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1182]). This triggers the deregistration of the job in the {{ResourceManager}}. The RM responses asynchronously at the end of this deregistration through {{disconnectResourceManager}} (see [ResourceManager:993|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L993]) which will trigger a reconnect on the {{JobMaster}}'s side (see [JobMaster::disconnectResourceManager|https://github.com/apache/flink/blob/da532423487e0534b5fe61f5a02366833f76193a/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L789]) if it's still around because the {{resourceManagerAddress}} (used in {{isConnectingToResourceManager}}) is not cleared. This would only happen during a RM leader change.
> The {{disconnectResourceManager}} will be ignored if the {{JobMaster}} is gone already.
> We should add a guard in some way to {{JobMaster}} to avoid reconnecting to other components during shutdown. This might not only include the ResourceManager connection but might also affect other parts of the {{JobMaster}} API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)