You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Elisha, Moshe (Nokia - IL/Kfar Sava)" <mo...@nokia.com> on 2022/06/16 07:24:24 UTC
Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Hello,
We are launching Flink deployments using the Flink Kubernetes Operator<https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/> on a Kubernetes cluster with Istio and mTLS enabled.
We found that the TaskManager is unable to communicate with the JobManager on the jobmanager-rpc port:
2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]
The reason for the issue is that the JobManager service port definitions are not following the Istio guidelines https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/ (see example below).
We believe a change to the default port definitions is needed but for now, is there an immediate action we can take to work around the issue? Perhaps overriding the default port definitions somehow?
Thanks.
flink-kubernetes-operator 1.0.0
Flink 1.14-java11
Kubernetes v1.19.5
Istio 1.7.6
# k get service inference-results-to-analytics-engine -o yaml
apiVersion: v1
kind: Service
metadata:
...
labels:
app: inference-results-to-analytics-engine
type: flink-native-kubernetes
name: inference-results-to-analytics-engine
spec:
clusterIP: None
ports:
- name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol" property
port: 6123
protocol: TCP
targetPort: 6123
- name: blobserver # should start with "tcp-" or add "appProtocol" property
port: 6124
protocol: TCP
targetPort: 6124
selector:
app: inference-results-to-analytics-engine
component: jobmanager
type: flink-native-kubernetes
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by "Elisha, Moshe (Nokia - IL/Kfar Sava)" <mo...@nokia.com>.
Hi,
Thank you all for your replies. The suggestion “to allow Akka cluster communication to bypass the Istio sidecar proxy” helped and we were able to deploy.
I understand and agree with the rational to “follow a standard as Flink and avoid implementing guidelines from different vendors/providers”.
That said, it was very hard to work around the issue and forced us to skip Istio proxy which is not ideal.
We would like you to consider changing the default port definitions or allow to override.
We have created a Jira improvement to continue the discussion in a formal way - https://issues.apache.org/jira/browse/FLINK-28171
Thanks a lot.
From: Martijn Visser <ma...@apache.org>
Date: Monday, 20 June 2022 at 15:13
To: Nathan Fisher <nf...@junctionbox.ca>
Cc: Őrhidi Mátyás <ma...@gmail.com>, Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>, Sigalit Eliazov <e....@gmail.com>, Yang Wang <da...@gmail.com>, user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
The Istio guideline implies that this is a guidance, not a standard. Is that correct? Is there a standard (already)? I think we should follow a standard as Flink and avoid implementing guidelines from different vendors/providers.
Op ma 20 jun. 2022 om 13:36 schreef Nathan Fisher <nf...@junctionbox.ca>>:
Would it make sense to add the annotations to the task manager and job manager? In a non-istio environment it’d be a noop.
mTLS as a requirement is more complicated but having some docs around using cert-manager might be enough depending on the orgs requirement.
On Mon, Jun 20, 2022 at 06:18, Őrhidi Mátyás <ma...@gmail.com>> wrote:
It seems Istio must be configured to allow Akka cluster communication to bypass the Istio sidecar proxy:
https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html
On Mon, Jun 20, 2022 at 11:30 AM Sigalit Eliazov <e....@gmail.com>> wrote:
Hi,
we have enabled HA as suggested, the task manager tries to reach the job manager via pod id as expected but
the task manager is unable to connect to the job manager:
2022-06-19 22:14:45,101 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)<http://flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)>.
2022-06-19 22:14:45,242 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [/192.168.3.144:6123<http://192.168.3.144:6123>] failed with java.io.IOException: Connection reset by peer
2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@192.168.3.144:6123<http://flink@192.168.3.144:6123>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@192.168.3.144:6123<http://flink@192.168.3.144:6123>]] Caused by: [The remote system explicitly disassociated (reason unknown).]
2022-06-19 22:14:45,255 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink@192.168.3.144:6123/user/rpc/resourcemanager_0<http://flink@192.168.3.144:6123/user/rpc/resourcemanager_0>, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@192.168.3.144:6123/user/rpc/resourcemanager_0<http://flink@192.168.3.144:6123/user/rpc/resourcemanager_0>.
2022-06-
Are there any additional definitions required for that?
thanks
Sigalit
On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <da...@gmail.com>> wrote:
Could you please have a try with high availability enabled[1]?
If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip.
[1]. https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
Best,
Yang
Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>> 于2022年6月16日周四 15:24写道:
Hello,
We are launching Flink deployments using the Flink Kubernetes Operator<https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/> on a Kubernetes cluster with Istio and mTLS enabled.
We found that the TaskManager is unable to communicate with the JobManager on the jobmanager-rpc port:
2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]
The reason for the issue is that the JobManager service port definitions are not following the Istio guidelines https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/ (see example below).
We believe a change to the default port definitions is needed but for now, is there an immediate action we can take to work around the issue? Perhaps overriding the default port definitions somehow?
Thanks.
flink-kubernetes-operator 1.0.0
Flink 1.14-java11
Kubernetes v1.19.5
Istio 1.7.6
# k get service inference-results-to-analytics-engine -o yaml
apiVersion: v1
kind: Service
metadata:
...
labels:
app: inference-results-to-analytics-engine
type: flink-native-kubernetes
name: inference-results-to-analytics-engine
spec:
clusterIP: None
ports:
- name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol" property
port: 6123
protocol: TCP
targetPort: 6123
- name: blobserver # should start with "tcp-" or add "appProtocol" property
port: 6124
protocol: TCP
targetPort: 6124
selector:
app: inference-results-to-analytics-engine
component: jobmanager
type: flink-native-kubernetes
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by Martijn Visser <ma...@apache.org>.
The Istio guideline implies that this is a guidance, not a standard. Is
that correct? Is there a standard (already)? I think we should follow a
standard as Flink and avoid implementing guidelines from different
vendors/providers.
Op ma 20 jun. 2022 om 13:36 schreef Nathan Fisher <nf...@junctionbox.ca>:
> Would it make sense to add the annotations to the task manager and job
> manager? In a non-istio environment it’d be a noop.
>
> mTLS as a requirement is more complicated but having some docs around
> using cert-manager might be enough depending on the orgs requirement.
>
> On Mon, Jun 20, 2022 at 06:18, Őrhidi Mátyás <ma...@gmail.com>
> wrote:
>
>> It seems Istio must be configured to allow Akka cluster communication to
>> bypass the Istio sidecar proxy:
>> https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html
>>
>> On Mon, Jun 20, 2022 at 11:30 AM Sigalit Eliazov <e....@gmail.com>
>> wrote:
>>
>>> Hi,
>>> we have enabled HA as suggested, the task manager tries to reach the job
>>> manager via pod id as expected but
>>> the task manager is unable to connect to the job manager:
>>>
>>>
>>> 2022-06-19 22:14:45,101 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://
>>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)
>>> .
>>>
>>>
>>> 2022-06-19 22:14:45,242 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [/
>>> 192.168.3.144:6123
>>> ] failed with java.io.IOException: Connection reset by peer
>>>
>>>
>>> 2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://
>>> flink@192.168.3.144:6123
>>> ] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://
>>> flink@192.168.3.144:6123
>>> ]] Caused by: [The remote system explicitly disassociated (reason unknown).]
>>>
>>>
>>> 2022-06-19 22:14:45,255 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://
>>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0
>>> , retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://
>>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0.
>>>
>>> 2022-06-
>>>
>>>
>>> Are there any additional definitions required for that?
>>>
>>>
>>> thanks
>>>
>>> Sigalit
>>>
>>> On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <da...@gmail.com> wrote:
>>>
>>>> Could you please have a try with high availability enabled[1]?
>>>>
>>>> If HA enabled, the internal jobmanager rpc service will not be created.
>>>> Instead, the TaskManager retrieves the JobManager address via HA services
>>>> and connects to it via pod ip.
>>>>
>>>> [1].
>>>> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>
>>>> 于2022年6月16日周四 15:24写道:
>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>>
>>>>> We are launching Flink deployments using the Flink Kubernetes Operator
>>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
>>>>> on a Kubernetes cluster with Istio and mTLS enabled.
>>>>>
>>>>>
>>>>>
>>>>> We found that the TaskManager is unable to communicate with the
>>>>> JobManager on the jobmanager-rpc port:
>>>>>
>>>>>
>>>>>
>>>>> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor
>>>>> [] - Association with remote system
>>>>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
>>>>> Caused by: [The remote system explicitly disassociated (reason unknown).]
>>>>>
>>>>>
>>>>>
>>>>> The reason for the issue is that the JobManager service port
>>>>> definitions are not following the Istio guidelines
>>>>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
>>>>> (see example below).
>>>>>
>>>>>
>>>>>
>>>>> We believe a change to the default port definitions is needed but for
>>>>> now, is there an immediate action we can take to work around the issue?
>>>>> Perhaps overriding the default port definitions somehow?
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> flink-kubernetes-operator 1.0.0
>>>>>
>>>>> Flink 1.14-java11
>>>>>
>>>>> Kubernetes v1.19.5
>>>>>
>>>>> Istio 1.7.6
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> # k get service inference-results-to-analytics-engine -o yaml
>>>>>
>>>>> apiVersion: v1
>>>>>
>>>>> kind: Service
>>>>>
>>>>> metadata:
>>>>>
>>>>> ...
>>>>>
>>>>> labels:
>>>>>
>>>>> app: inference-results-to-analytics-engine
>>>>>
>>>>> type: flink-native-kubernetes
>>>>>
>>>>> name: inference-results-to-analytics-engine
>>>>>
>>>>> spec:
>>>>>
>>>>> clusterIP: None
>>>>>
>>>>> ports:
>>>>>
>>>>> - name: jobmanager-rpc # should start with “tcp-“ or add "
>>>>> appProtocol" property
>>>>>
>>>>> port: 6123
>>>>>
>>>>> protocol: TCP
>>>>>
>>>>> targetPort: 6123
>>>>>
>>>>> - name: blobserver # should start with "tcp-" or add "appProtocol"
>>>>> property
>>>>>
>>>>> port: 6124
>>>>>
>>>>> protocol: TCP
>>>>>
>>>>> targetPort: 6124
>>>>>
>>>>> selector:
>>>>>
>>>>> app: inference-results-to-analytics-engine
>>>>>
>>>>> component: jobmanager
>>>>>
>>>>> type: flink-native-kubernetes
>>>>>
>>>>> sessionAffinity: None
>>>>>
>>>>> type: ClusterIP
>>>>>
>>>>> status:
>>>>>
>>>>> loadBalancer: {}
>>>>>
>>>>>
>>>>>
>>>>
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by Nathan Fisher <nf...@junctionbox.ca>.
Would it make sense to add the annotations to the task manager and job
manager? In a non-istio environment it’d be a noop.
mTLS as a requirement is more complicated but having some docs around using
cert-manager might be enough depending on the orgs requirement.
On Mon, Jun 20, 2022 at 06:18, Őrhidi Mátyás <ma...@gmail.com>
wrote:
> It seems Istio must be configured to allow Akka cluster communication to
> bypass the Istio sidecar proxy:
> https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html
>
> On Mon, Jun 20, 2022 at 11:30 AM Sigalit Eliazov <e....@gmail.com>
> wrote:
>
>> Hi,
>> we have enabled HA as suggested, the task manager tries to reach the job
>> manager via pod id as expected but
>> the task manager is unable to connect to the job manager:
>>
>>
>> 2022-06-19 22:14:45,101 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://
>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)
>> .
>>
>>
>> 2022-06-19 22:14:45,242 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [/
>> 192.168.3.144:6123
>> ] failed with java.io.IOException: Connection reset by peer
>>
>>
>> 2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://
>> flink@192.168.3.144:6123
>> ] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://
>> flink@192.168.3.144:6123
>> ]] Caused by: [The remote system explicitly disassociated (reason unknown).]
>>
>>
>> 2022-06-19 22:14:45,255 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://
>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0
>> , retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://
>> flink@192.168.3.144:6123/user/rpc/resourcemanager_0.
>>
>> 2022-06-
>>
>>
>> Are there any additional definitions required for that?
>>
>>
>> thanks
>>
>> Sigalit
>>
>> On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <da...@gmail.com> wrote:
>>
>>> Could you please have a try with high availability enabled[1]?
>>>
>>> If HA enabled, the internal jobmanager rpc service will not be created.
>>> Instead, the TaskManager retrieves the JobManager address via HA services
>>> and connects to it via pod ip.
>>>
>>> [1].
>>> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>
>>> 于2022年6月16日周四 15:24写道:
>>>
>>>> Hello,
>>>>
>>>>
>>>>
>>>> We are launching Flink deployments using the Flink Kubernetes Operator
>>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
>>>> on a Kubernetes cluster with Istio and mTLS enabled.
>>>>
>>>>
>>>>
>>>> We found that the TaskManager is unable to communicate with the
>>>> JobManager on the jobmanager-rpc port:
>>>>
>>>>
>>>>
>>>> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor
>>>> [] - Association with remote system
>>>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
>>>> Caused by: [The remote system explicitly disassociated (reason unknown).]
>>>>
>>>>
>>>>
>>>> The reason for the issue is that the JobManager service port
>>>> definitions are not following the Istio guidelines
>>>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
>>>> (see example below).
>>>>
>>>>
>>>>
>>>> We believe a change to the default port definitions is needed but for
>>>> now, is there an immediate action we can take to work around the issue?
>>>> Perhaps overriding the default port definitions somehow?
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> flink-kubernetes-operator 1.0.0
>>>>
>>>> Flink 1.14-java11
>>>>
>>>> Kubernetes v1.19.5
>>>>
>>>> Istio 1.7.6
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> # k get service inference-results-to-analytics-engine -o yaml
>>>>
>>>> apiVersion: v1
>>>>
>>>> kind: Service
>>>>
>>>> metadata:
>>>>
>>>> ...
>>>>
>>>> labels:
>>>>
>>>> app: inference-results-to-analytics-engine
>>>>
>>>> type: flink-native-kubernetes
>>>>
>>>> name: inference-results-to-analytics-engine
>>>>
>>>> spec:
>>>>
>>>> clusterIP: None
>>>>
>>>> ports:
>>>>
>>>> - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
>>>> property
>>>>
>>>> port: 6123
>>>>
>>>> protocol: TCP
>>>>
>>>> targetPort: 6123
>>>>
>>>> - name: blobserver # should start with "tcp-" or add "appProtocol"
>>>> property
>>>>
>>>> port: 6124
>>>>
>>>> protocol: TCP
>>>>
>>>> targetPort: 6124
>>>>
>>>> selector:
>>>>
>>>> app: inference-results-to-analytics-engine
>>>>
>>>> component: jobmanager
>>>>
>>>> type: flink-native-kubernetes
>>>>
>>>> sessionAffinity: None
>>>>
>>>> type: ClusterIP
>>>>
>>>> status:
>>>>
>>>> loadBalancer: {}
>>>>
>>>>
>>>>
>>>
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by Őrhidi Mátyás <ma...@gmail.com>.
It seems Istio must be configured to allow Akka cluster communication to
bypass the Istio sidecar proxy:
https://doc.akka.io/docs/akka-management/current/bootstrap/istio.html
On Mon, Jun 20, 2022 at 11:30 AM Sigalit Eliazov <e....@gmail.com>
wrote:
> Hi,
> we have enabled HA as suggested, the task manager tries to reach the job
> manager via pod id as expected but
> the task manager is unable to connect to the job manager:
>
>
> 2022-06-19 22:14:45,101 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Connecting to ResourceManager akka.tcp://
> flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)
> .
>
>
> 2022-06-19 22:14:45,242 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [/
> 192.168.3.144:6123
> ] failed with java.io.IOException: Connection reset by peer
>
>
> 2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://
> flink@192.168.3.144:6123
> ] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://
> flink@192.168.3.144:6123
> ]] Caused by: [The remote system explicitly disassociated (reason unknown).]
>
>
> 2022-06-19 22:14:45,255 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://
> flink@192.168.3.144:6123/user/rpc/resourcemanager_0
> , retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://
> flink@192.168.3.144:6123/user/rpc/resourcemanager_0.
>
> 2022-06-
>
>
> Are there any additional definitions required for that?
>
>
> thanks
>
> Sigalit
>
> On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <da...@gmail.com> wrote:
>
>> Could you please have a try with high availability enabled[1]?
>>
>> If HA enabled, the internal jobmanager rpc service will not be created.
>> Instead, the TaskManager retrieves the JobManager address via HA services
>> and connects to it via pod ip.
>>
>> [1].
>> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
>>
>>
>> Best,
>> Yang
>>
>> Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>
>> 于2022年6月16日周四 15:24写道:
>>
>>> Hello,
>>>
>>>
>>>
>>> We are launching Flink deployments using the Flink Kubernetes Operator
>>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
>>> on a Kubernetes cluster with Istio and mTLS enabled.
>>>
>>>
>>>
>>> We found that the TaskManager is unable to communicate with the
>>> JobManager on the jobmanager-rpc port:
>>>
>>>
>>>
>>> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor
>>> [] - Association with remote system
>>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
>>> Caused by: [The remote system explicitly disassociated (reason unknown).]
>>>
>>>
>>>
>>> The reason for the issue is that the JobManager service port definitions are
>>> not following the Istio guidelines
>>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
>>> (see example below).
>>>
>>>
>>>
>>> We believe a change to the default port definitions is needed but for
>>> now, is there an immediate action we can take to work around the issue?
>>> Perhaps overriding the default port definitions somehow?
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> flink-kubernetes-operator 1.0.0
>>>
>>> Flink 1.14-java11
>>>
>>> Kubernetes v1.19.5
>>>
>>> Istio 1.7.6
>>>
>>>
>>>
>>>
>>>
>>> # k get service inference-results-to-analytics-engine -o yaml
>>>
>>> apiVersion: v1
>>>
>>> kind: Service
>>>
>>> metadata:
>>>
>>> ...
>>>
>>> labels:
>>>
>>> app: inference-results-to-analytics-engine
>>>
>>> type: flink-native-kubernetes
>>>
>>> name: inference-results-to-analytics-engine
>>>
>>> spec:
>>>
>>> clusterIP: None
>>>
>>> ports:
>>>
>>> - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
>>> property
>>>
>>> port: 6123
>>>
>>> protocol: TCP
>>>
>>> targetPort: 6123
>>>
>>> - name: blobserver # should start with "tcp-" or add "appProtocol"
>>> property
>>>
>>> port: 6124
>>>
>>> protocol: TCP
>>>
>>> targetPort: 6124
>>>
>>> selector:
>>>
>>> app: inference-results-to-analytics-engine
>>>
>>> component: jobmanager
>>>
>>> type: flink-native-kubernetes
>>>
>>> sessionAffinity: None
>>>
>>> type: ClusterIP
>>>
>>> status:
>>>
>>> loadBalancer: {}
>>>
>>>
>>>
>>
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by Sigalit Eliazov <e....@gmail.com>.
Hi,
we have enabled HA as suggested, the task manager tries to reach the job
manager via pod id as expected but
the task manager is unable to connect to the job manager:
2022-06-19 22:14:45,101 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Connecting to ResourceManager akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0(8a98fdb734615089485c685afb0f402d)
.
2022-06-19 22:14:45,242 WARN
akka.remote.transport.netty.NettyTransport [] -
Remote connection to [/
192.168.3.144:6123
] failed with java.io.IOException: Connection reset by peer
2022-06-19 22:14:45,249 WARN akka.remote.ReliableDeliverySupervisor
[] - Association with remote system [akka.tcp://
flink@192.168.3.144:6123
] has failed, address is now gated for [50] ms. Reason: [Association
failed with [akka.tcp://
flink@192.168.3.144:6123
]] Caused by: [The remote system explicitly disassociated (reason unknown).]
2022-06-19 22:14:45,255 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Could not resolve ResourceManager address akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0
, retrying in 10000 ms: Could not connect to rpc endpoint under
address akka.tcp://
flink@192.168.3.144:6123/user/rpc/resourcemanager_0.
2022-06-
Are there any additional definitions required for that?
thanks
Sigalit
On Thu, Jun 16, 2022 at 2:28 PM Yang Wang <da...@gmail.com> wrote:
> Could you please have a try with high availability enabled[1]?
>
> If HA enabled, the internal jobmanager rpc service will not be created.
> Instead, the TaskManager retrieves the JobManager address via HA services
> and connects to it via pod ip.
>
> [1].
> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
>
>
> Best,
> Yang
>
> Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com>
> 于2022年6月16日周四 15:24写道:
>
>> Hello,
>>
>>
>>
>> We are launching Flink deployments using the Flink Kubernetes Operator
>> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
>> on a Kubernetes cluster with Istio and mTLS enabled.
>>
>>
>>
>> We found that the TaskManager is unable to communicate with the
>> JobManager on the jobmanager-rpc port:
>>
>>
>>
>> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor
>> [] - Association with remote system
>> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
>> Caused by: [The remote system explicitly disassociated (reason unknown).]
>>
>>
>>
>> The reason for the issue is that the JobManager service port definitions are
>> not following the Istio guidelines
>> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
>> (see example below).
>>
>>
>>
>> We believe a change to the default port definitions is needed but for
>> now, is there an immediate action we can take to work around the issue?
>> Perhaps overriding the default port definitions somehow?
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> flink-kubernetes-operator 1.0.0
>>
>> Flink 1.14-java11
>>
>> Kubernetes v1.19.5
>>
>> Istio 1.7.6
>>
>>
>>
>>
>>
>> # k get service inference-results-to-analytics-engine -o yaml
>>
>> apiVersion: v1
>>
>> kind: Service
>>
>> metadata:
>>
>> ...
>>
>> labels:
>>
>> app: inference-results-to-analytics-engine
>>
>> type: flink-native-kubernetes
>>
>> name: inference-results-to-analytics-engine
>>
>> spec:
>>
>> clusterIP: None
>>
>> ports:
>>
>> - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
>> property
>>
>> port: 6123
>>
>> protocol: TCP
>>
>> targetPort: 6123
>>
>> - name: blobserver # should start with "tcp-" or add "appProtocol"
>> property
>>
>> port: 6124
>>
>> protocol: TCP
>>
>> targetPort: 6124
>>
>> selector:
>>
>> app: inference-results-to-analytics-engine
>>
>> component: jobmanager
>>
>> type: flink-native-kubernetes
>>
>> sessionAffinity: None
>>
>> type: ClusterIP
>>
>> status:
>>
>> loadBalancer: {}
>>
>>
>>
>
Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions
Posted by Yang Wang <da...@gmail.com>.
Could you please have a try with high availability enabled[1]?
If HA enabled, the internal jobmanager rpc service will not be created.
Instead, the TaskManager retrieves the JobManager address via HA services
and connects to it via pod ip.
[1].
https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml
Best,
Yang
Elisha, Moshe (Nokia - IL/Kfar Sava) <mo...@nokia.com> 于2022年6月16日周四
15:24写道:
> Hello,
>
>
>
> We are launching Flink deployments using the Flink Kubernetes Operator
> <https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/>
> on a Kubernetes cluster with Istio and mTLS enabled.
>
>
>
> We found that the TaskManager is unable to communicate with the JobManager
> on the jobmanager-rpc port:
>
>
>
> 2022-06-15 15:25:40,508 WARN akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
> Caused by: [The remote system explicitly disassociated (reason unknown).]
>
>
>
> The reason for the issue is that the JobManager service port definitions are
> not following the Istio guidelines
> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
> (see example below).
>
>
>
> We believe a change to the default port definitions is needed but for now,
> is there an immediate action we can take to work around the issue? Perhaps
> overriding the default port definitions somehow?
>
>
>
> Thanks.
>
>
>
>
>
> flink-kubernetes-operator 1.0.0
>
> Flink 1.14-java11
>
> Kubernetes v1.19.5
>
> Istio 1.7.6
>
>
>
>
>
> # k get service inference-results-to-analytics-engine -o yaml
>
> apiVersion: v1
>
> kind: Service
>
> metadata:
>
> ...
>
> labels:
>
> app: inference-results-to-analytics-engine
>
> type: flink-native-kubernetes
>
> name: inference-results-to-analytics-engine
>
> spec:
>
> clusterIP: None
>
> ports:
>
> - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
> property
>
> port: 6123
>
> protocol: TCP
>
> targetPort: 6123
>
> - name: blobserver # should start with "tcp-" or add "appProtocol"
> property
>
> port: 6124
>
> protocol: TCP
>
> targetPort: 6124
>
> selector:
>
> app: inference-results-to-analytics-engine
>
> component: jobmanager
>
> type: flink-native-kubernetes
>
> sessionAffinity: None
>
> type: ClusterIP
>
> status:
>
> loadBalancer: {}
>
>
>