You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2020/08/28 09:56:00 UTC
[jira] [Closed] (FLINK-14328) JobCluster cannot reach TaskManager
in K8s
[ https://issues.apache.org/jira/browse/FLINK-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler closed FLINK-14328.
------------------------------------
Fix Version/s: (was: 1.9.4)
Resolution: Cannot Reproduce
> JobCluster cannot reach TaskManager in K8s
> ------------------------------------------
>
> Key: FLINK-14328
> URL: https://issues.apache.org/jira/browse/FLINK-14328
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Reporter: Tim
> Priority: Major
>
> I have a Job Cluster which I am running in K8s. It consists of
> * job manager deployment (1)
> * task manager deployment (1)
> * service
> This is more or less following the standard "Job Cluster" setup. Additionally, (due to known issues of TMs talking to JMs), I have set taskmanager.network.bind-policy to "ip", so that the task manager binds on the IP of the pod rather than the pod name (which is not reachable via DNS). So far so good.
>
> Once the cluster is started, I can see the job running. I also see that the JM's resource msnager has registered the TM.
> {code:java}
> 2019-10-05 20:37:14.554 [flink-akka.actor.default-dispatcher-4] DEBUG org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Slot Pool Status:
> status: connected to akka.tcp://flink@data-capture-enrichedtrans-raw-jobcluster:6123/user/resourcemanager
> registered TaskManagers: [f34656491b8dfae726d992d276dc6d39]
> available slots: []
> allocated slots: [[AllocatedSlot a00f44d19f38ca36da3ae5083c2d02ae @ f34656491b8dfae726d992d276dc6d39 @ data-capture-enrichedtrans-raw-taskmanager-674476f57c-26kxr (dataPort=35815) - 0]]
> pending requests: []
> }
> {code}
> However, I see several errors like below, before the job eventually fails (maybe after 5 minutes), and goes into recovery. This happens until all restarts are exhaused, at which point the cluster completely fails.
> {code:java}
> 2019-10-05 20:42:14.768 [flink-akka.actor.default-dispatcher-19] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-6 - Association with remote system [akka.tcp://flink@10.107.38.92:50100] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.107.38.92:50100]] Caused by: [java.net.ConnectException: Connection refused: /10.107.38.92:50100]
> {code}
> {{To me it looks like the JM is not able to make a connection on the RPC port of the taskmanager (50100 is the taskmanager.rpc.port setting, and 10.107.38.92 is the IP address of the task manager pod as seen by "kubectl describe pod".)}}
> {{Has anyone come across this issue?}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)