You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (Jira)" <ji...@apache.org> on 2020/07/24 10:29:00 UTC
[jira] [Commented] (FLINK-11127) Make metrics query service establish connection to JobManager

    [ https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164351#comment-17164351 ] 

Robert Metzger commented on FLINK-11127:
----------------------------------------

After offline discussions with [~trohrmann] and [~uce], I believe we can close this issue. Since FLINK-11632 is configuring the "taskmanager.network.bind-policy" by default to "ip", establishing the connection from the TMs to the JM should always work on Kubernetes.
A good argument why this works is the way how the network stack (specifically netty) is establishing connections between the TaskManagers: They also establish connections among each other via the TM IP. So the connection between any pods in K8s based on IP addresses should work, if not, we would have much bigger problems.

I don't fully get what [~aroch] means by "accessible from the outside":
{quote}Also, the "ip" bind-policy would not help because the resolved IP is the internal network IP which is not accessible from outside and JM fails to fetch metrics.{quote}
... if you mean "outside" as in outside the K8s cluster (say through a loadbalancer into the cluster), then I agree, the IP won't be accessible. But that's also not what we need here. Internal access is sufficient.
Unless [~aroch] can describe a scenario where the JM can not connect to the TMs, I would close this ticket. 



> Make metrics query service establish connection to JobManager
> -------------------------------------------------------------
>
>                 Key: FLINK-11127
>                 URL: https://issues.apache.org/jira/browse/FLINK-11127
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes, Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.7.0, 1.9.2, 1.10.0
>            Reporter: Ufuk Celebi
>            Assignee: Robert Metzger
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As part of FLINK-10247, the internal metrics query service has been separated into its own actor system. Before this change, the JobManager (JM) queried TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a separate connection to the TM metrics query service actor.
> In the context of Kubernetes, this is problematic as the JM will typically *not* be able to resolve the TMs by name, resulting in warnings as follows:
> {code}
> 2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
> {code}
> In order to expose the TMs by name in Kubernetes, users require a service *for each* TM instance which is not practical.
> This currently results in the web UI not being to display some basic metrics about number of sent records. You can reproduce this by following the READMEs in {{flink-container/kubernetes}}.
> This worked before, because the JM is typically exposed via a service with a known name and the TMs establish the connection to it which the metrics query service piggybacked on.
> A potential solution to this might be to let the query service connect to the JM similar to how the TMs register.
> I tagged this ticket as an improvement, but in the context of Kubernetes I would consider this to be a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)