You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2020/08/28 09:56:00 UTC
[jira] [Closed] (FLINK-14328) JobCluster cannot reach TaskManager in K8s

     [ https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chesnay Schepler closed FLINK-14328.
------------------------------------
    Fix Version/s:     (was: 1.9.4)
       Resolution: Cannot Reproduce

> JobCluster cannot reach TaskManager in K8s
> ------------------------------------------
>
>                 Key: FLINK-14328
>                 URL: https://issues.apache.org/jira/browse/FLINK-14328
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>            Reporter: Tim
>            Priority: Major
>
> I have a Job Cluster which I am running in K8s.  It consists of
>  * job manager deployment (1)
>  * task manager deployment (1)
>  * service
> This is more or less following the standard "Job Cluster" setup.   Additionally, (due to known issues of TMs talking to JMs), I have set taskmanager.network.bind-policy to "ip", so that the task manager binds on the IP of the pod rather than the pod name (which is not reachable via DNS).   So far so good.
>  
> Once the cluster is started, I can see the job running.  I also see that the JM's resource msnager has registered the TM.
> {code:java}
> 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Slot Pool Status:
>         status: connected to akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
>         registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
>         available slots: []
>         allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ f34656491b8dfae726d992d276dc6d39 @ data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) - 0]]
>         pending requests: []
>         }
> {code}
> However, I see several errors like below, before the job eventually fails (maybe after 5 minutes), and goes into recovery.   This happens until all restarts are exhaused, at which point the cluster completely fails.
> {code:java}
> 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-6 - Association with remote system [akka.tcp://flink@10.107.38.92:50100] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.107.38.92:50100]] Caused by: [java.net.ConnectException: Connection refused: /10.107.38.92:50100]
> {code}
> {{To me it looks like the JM is not able to make a connection on the RPC port of the taskmanager (50100 is the taskmanager.rpc.port setting, and 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl describe pod".)}}
> {{Has anyone come across this issue?}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)