You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Huang Xingbo (Jira)" <ji...@apache.org> on 2022/06/09 06:46:00 UTC
[jira] [Updated] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again
[ https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Huang Xingbo updated FLINK-27420:
---------------------------------
Fix Version/s: 1.14.6
(was: 1.14.5)
> Suspended SlotManager fail to reregister metrics when started again
> -------------------------------------------------------------------
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Metrics
> Affects Versions: 1.13.5
> Reporter: Ben Augarten
> Assignee: Ben Augarten
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.16.0, 1.15.1, 1.14.6
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and taskslotstotal) when a SlotManager is suspended and then restarted. We noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 1.15.x, and master.
>
> When a SlotManager is suspended, the [metrics group is closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214]. When the SlotManager is [started again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181], it makes an attempt to [reregister metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],] but that fails because the underlying metrics group [is still closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>
> I was able to trace through this issue by restarting zookeeper nodes in a staging environment and watching the JM with a debugger.
>
> A concise test, which currently fails, shows the expected behavior – [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>
> I am happy to provide a PR to fix this issue, but first would like to verify that this is not intended.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)