You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "geonyeong kim (Jira)" <ji...@apache.org> on 2022/08/26 08:42:00 UTC

[jira] [Created] (FLINK-29117) Tried to associate with unreachable remote address

geonyeong kim created FLINK-29117:
-------------------------------------

Summary: Tried to associate with unreachable remote address
Key: FLINK-29117
URL: https://issues.apache.org/jira/browse/FLINK-29117
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes, flink-contrib, flink-docker, Kubernetes Operator
Affects Versions: kubernetes-operator-1.1.0, 1.15.1
Reporter: geonyeong kim
Attachments: Screen Shot 2022-08-26 at 5.04.37 PM.png

Hello.

I am planning to distribute and use FlinkDeployment through the flink kubernetes operator.

CRD, operator, webbook, etc. are all set up, and we actually distributed FlinkDeployment to confirm normal operation.

*However, strangely, connecting to resource manager fails if you make more than one task manager pod replica.*

I thought it might be a problem with akka, timeout, etc. so I increased the values as below
The connection continues to fail.

- akka.retry-gate-closed-for: 10000
- akka.server-socket-worker-pool.pool-size-min: 6
- akka.server-socket-worker-pool.pool-size-max: 10
- akka.client-socket-worker-pool.pool-size-max: 10
- akka.client-socket-worker-pool.pool-size-min: 6
- blob.client.connect.

The log of the taskmanager is as follows.

{code:java}
Association with remote system [akka.tcp://flink@10.238.80.92:6123] has failed, address is now gated for [10000] ms. Reason: [Disassociated] Could not resolve ResourceManager address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.238.80.92:6123/user/rpc/resourcemanager_1. Tried to associate with unreachable remote address [akka.tcp://flink@10.238.80.92:6123]. Address is now gated for 10000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]{code}

*If you go into the task manager pod and tcp check, the connection is open.*

*Below are the flink versions I used.*

*- flink image: 1.15.1*
*- flink kubernetes operator: 1.1.0*

*I would appreciate it if you could check the problem quickly.*
*If it's a bug, please tell me how to detour in the current situation.*

--
This message was sent by Atlassian Jira
(v8.20.10#820010)