You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fuyao Li <fu...@oracle.com> on 2021/03/26 21:59:22 UTC

Need help with executing Flink CLI for native Kubernetes deployment

Hi Community, Yang,

I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]

I am using Flink 1.12.1, Scala 2.11, java 11.
I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated.


The relevant Flink source code is CliFrontend.java class [2]
The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work.

# List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below)
$ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
# Cancel running job (I can’t execute this command succcessfully)
$ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId>

I think those commands needs to communicate with the endpoint that shows after the the job submission command.


  1.  Use case 1(deploy with NodePort)

# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=NodePort \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


When the expose type is NodePort, the printed messages says the the Flink  JobManager Web Interface:is at http://192.29.104.156:30996  192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable.
I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster)
100.104.154.73:30996
100.104.154.74:30996
100.104.154.75:30996
      I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address?



  1.  Use case 2 (deploy with LoadBalancer)
# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=LoadBalancer \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081
When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5]

I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics..




Reference:
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
[2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui
[4] https://pastebin.ubuntu.com/p/WcJMwds52r/
[5] https://pastebin.ubuntu.com/p/m27BnQGXQc/


Thanks for your help in advance.

Best regards,
Fuyao



Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Fuyao Li <fu...@oracle.com>.
Hi Yang,

Thanks for the reply, those information is very helpful.

Best,
Fuyao

From: Yang Wang <da...@gmail.com>
Date: Tuesday, April 6, 2021 at 01:11
To: Fuyao Li <fu...@oracle.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Sorry for the late reply.

It is not very hard to develop your own deployer. Actually, I have 3 days for developing the PoC version of flink-native-k8s-operator. So
if you want to have a fully functional K8s operator, maybe two weeks is enough. But if you want to put it into production, you may need
some more time to polish it for easier use.

Flink native K8s integration is not going to replace the standalone mode. First, not all the Flink standalone clusters are running on the K8s.
And standalone mode could work really well with reactive mode[1].


Flink native K8s integration is not going to replace the K8s operator. Actually, the Flink K8s operator is not on the same level of Flink native
integration. The Flink k8s operator is responsible for managing the lifecycle of Flink application. Also it is to make the submission more K8s style.
The google and lyft Flink k8s operator could support native mode. They just do not have the support right now.


Kubernetes HA could work both for standalone mode and native mode. You could find the configuration here[2]. However, you might
need some changes on the Flink k8s operator to make it work. Because we need to add more args(e.g. --host) to the JobManager start commands.


[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode<https://urldefense.com/v3/__https:/cwiki.apache.org/confluence/display/FLINK/FLIP-159*3A*Reactive*Mode__;JSsr!!GqivPVa7Brio!LlA2Z6MFKqpH1XysG69hwElbXv4nC7ZsP5YFYscbPpNQVoXSn_r_39Hr_K1mDh0$>
[2]. https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#high-availability-with-standalone-kubernetes<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/*high-availability-with-standalone-kubernetes__;Iw!!GqivPVa7Brio!LlA2Z6MFKqpH1XysG69hwElbXv4nC7ZsP5YFYscbPpNQVoXSn_r_39Hr5zvpqB4$>

Best,
Yang


Fuyao Li <fu...@oracle.com>> 于2021年4月5日周一 下午1:33写道:
Hello Yang,

I am just following up the previous email to see if you got some time to reply.
I also took a deeper look into lyft k8s operator recently. It seems it doesn’t support HA natively. It still needs the help of ZooKeeper. In terms of this, native k8s is better. Any other ideas? Thanks for your help.

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Thursday, April 1, 2021 at 12:22
To: Yang Wang <da...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Yang,

Thanks for sharing the insights.

For problem 1:
I think I can’t do telnet in the container. I tried to use curl 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!LlA2Z6MFKqpH1XysG69hwElbXv4nC7ZsP5YFYscbPpNQVoXSn_r_39HrBFdpzbY$> and I could see the HTML of Flink dashboard UI. This proves such public IP is reachable inside the cluster. Just as you mentioned, there might still be some network issues with the cluster. I will do some further check.

For problem 2:
I created a new K8S cluster with bastion server with some public IP assigned to it. Finally, I can see something valid from my browser. (There still exist some problems with connecting to some databases, but I think these network problems are not directly related to Flink, I can investigate into it later.)

For problem 3:
Thanks for sharing the repo you created. I am not sure how much work it could take to develop a deployer. I understand is depends on the proficiency, could you give a rough estimation? If it is too complicated and some other options are not significantly inferior to native Kubernetes. I might prefer to choose other options. I am currently comparing different options to deploy in Kubernetes.

  1.  Standalone K8S
  2.  Native Kubernetes
  3.  Flink operator (Google Cloud Platform/ Lyft) [1][2]

I also watched the demo video you presented. [3] I noticed you mentioned that native K8S is not going to replace the other two options. I still doesn’t fully get your idea with limited explanation in the demo. Could you compare the tradeoff a little bit? Thanks!
[1] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator<https://urldefense.com/v3/__https:/github.com/GoogleCloudPlatform/flink-on-k8s-operator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY_v6kdJg$>
[2]  https://github.com/lyft/flinkk8soperator<https://urldefense.com/v3/__https:/github.com/lyft/flinkk8soperator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sYbe30Oc8$>
[3] https://youtu.be/pdFPr_VOWTU<https://urldefense.com/v3/__https:/youtu.be/pdFPr_VOWTU__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY-or9EkA$>

Best,
Fuyao


From: Yang Wang <da...@gmail.com>>
Date: Tuesday, March 30, 2021 at 19:15
To: Fuyao Li <fu...@oracle.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for sharing the progress.

1. The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?

I think this is a environment problem. Actually, not every IP address could be tested with "ping" command. I suggest you to use "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>" to check the network connectivity.

2. Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?

Just like my above answer, I think you need to check the network connectivity via "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>". Maybe because the firewall is not allowed connecting from your local(e.g. your local ip is not in the white list of LoadBalancer IP).

In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.

I think in production environment, you should have your own deployer, which will take care of submitting the jobs, list/cancel the jobs. Even the deployer could help with triggering savepoint and manage the whole lifecycle of Flink applications. I used to develop a PoC of native-flink-k8s-operator[1]. It could be a start point of your own deployer if you want to develop it in JAVA.

[1]. https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJKkKXY-w$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月31日周三 上午6:37写道:
Hello Yang,

Thank you so much for providing me the flink-client.yaml. I was able to make some progress. I didn’t realize I should create an new pod flink-client to list/cancel jobs. I was trying to do such a thing from my local laptop. Maybe that is the reason why it doesn’t work. However, I still have several questions.

I created the deployment based on your flink-client.yaml
For the LoadBalancer mode:

After apply the cluster role binding yaml below.

# https://kubernetes.io/docs/reference/access-authn-authz/rbac/<https://urldefense.com/v3/__https:/kubernetes.io/docs/reference/access-authn-authz/rbac/__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJOLipbis$>
# https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services<https://urldefense.com/v3/__https:/stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJYqhHkxk$>
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1<https://urldefense.com/v3/__http:/rbac.authorization.k8s.io/v1__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJdfnDjyA$>
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

And execute the command:
kubectl create clusterrolebinding service-reader-pod  --clusterrole=service-reader  --serviceaccount=default:default

I am able to exec in the flink-client pod and list/cancel jobs.

$ kubectl exec -it flink-client-776886cf4f-9h47f bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
2021-03-30 21:53:14,513 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster my-first-application-cluster successfully, JobManager Web Interface: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
Waiting for response...
------------------ Running/Restarting Jobs -------------------
24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming Java API Skeleton (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.

^C
--- 144.25.13.78 ping statistics ---
31 packets transmitted, 0 received, 100% packet loss, time 772ms

Question:

  1.  The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?
  2.  Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?
  3.  In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.


Best,
Fuyao

From: Yang Wang <da...@gmail.com>>
Date: Monday, March 29, 2021 at 20:40
To: Fuyao Li <fu...@oracle.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for trying the native Kubernetes integration.

Just like you know, the Flink rest service could be exposed in following three types, configured via "kubernetes.rest-service.exposed.type".

* ClusterIP, which means you could only access the Flink rest endpoint inside the K8s cluster. Simply, users could start a Flink client in the
K8s cluster via the following yaml file. And use "kubectl exec" to tunnel in the pod to create a Flink session/application cluster. Also the
"flink list/cancel" could work well.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink-client
  template:
    metadata:
      labels:
        app: flink-client
    spec:
      containers:
      - name: client
        image: flink:1.12.2
        imagePullPolicy: Always
        args: ["sleep", "86400"]

* NodePort
Currently, we have a limitation that only the Kubernetes master nodes could be used to build the Flink exposed rest endpoint. So if your
APIServer node does not have the kube proxy, then the printed URL in the Flink client logs could not be used. We already have a ticket[1] to
support one of the slave nodes for accessing the rest endpoint. But I have not managed myself to get it done.

* LoadBalancer
Is the resolved rest endpoint "http://144.25.13.78:8081/<https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>" accessible on your Flink client side? If it is yes, then I think the Flink client
should be able to contact to JobManager rest server to list/cancel the jobs. I have verified in Alibaba container service, and it works well.


[1]. https://issues.apache.org/jira/browse/FLINK-16601<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月27日周六 上午5:59写道:
Hi Community, Yang,

I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]

I am using Flink 1.12.1, Scala 2.11, java 11.
I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated.


The relevant Flink source code is CliFrontend.java class [2]
The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work.

# List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below)
$ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
# Cancel running job (I can’t execute this command succcessfully)
$ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId>

I think those commands needs to communicate with the endpoint that shows after the the job submission command.


  1.  Use case 1(deploy with NodePort)

# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=NodePort \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


When the expose type is NodePort, the printed messages says the the Flink  JobManager Web Interface:is at http://192.29.104.156:30996<https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>  192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable.
I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster)
100.104.154.73:30996<https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
100.104.154.74:30996<https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
100.104.154.75:30996<https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
      I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address?



  1.  Use case 2 (deploy with LoadBalancer)
# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=LoadBalancer \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5]

I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics..




Reference:
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
[2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
[4] https://pastebin.ubuntu.com/p/WcJMwds52r/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
[5] https://pastebin.ubuntu.com/p/m27BnQGXQc/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>


Thanks for your help in advance.

Best regards,
Fuyao



Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Yang Wang <da...@gmail.com>.
Hi Fuyao,

Sorry for the late reply.

It is not very hard to develop your own deployer. Actually, I have 3 days
for developing the PoC version of flink-native-k8s-operator. So
if you want to have a fully functional K8s operator, maybe two weeks is
enough. But if you want to put it into production, you may need
some more time to polish it for easier use.

Flink native K8s integration is not going to replace the standalone mode.
First, not all the Flink standalone clusters are running on the K8s.
And standalone mode could work really well with reactive mode[1].


Flink native K8s integration is not going to replace the K8s operator.
Actually, the Flink K8s operator is not on the same level of Flink native
integration. The Flink k8s operator is responsible for managing the
lifecycle of Flink application. Also it is to make the submission more K8s
style.
The google and lyft Flink k8s operator could support native mode. They just
do not have the support right now.


Kubernetes HA could work both for standalone mode and native mode. You
could find the configuration here[2]. However, you might
need some changes on the Flink k8s operator to make it work. Because we
need to add more args(e.g. --host) to the JobManager start commands.


[1].
https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode
[2].
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#high-availability-with-standalone-kubernetes

Best,
Yang


Fuyao Li <fu...@oracle.com> 于2021年4月5日周一 下午1:33写道:

> Hello Yang,
>
>
>
> I am just following up the previous email to see if you got some time to
> reply.
>
> I also took a deeper look into lyft k8s operator recently. It seems it
> doesn’t support HA natively. It still needs the help of ZooKeeper. In terms
> of this, native k8s is better. Any other ideas? Thanks for your help.
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Thursday, April 1, 2021 at 12:22
> *To: *Yang Wang <da...@gmail.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *Re: [External] : Re: Need help with executing Flink CLI for
> native Kubernetes deployment
>
> Hi Yang,
>
>
>
> Thanks for sharing the insights.
>
>
>
> For problem 1:
>
> I think I can’t do telnet in the container. I tried to use curl
> 144.25.13.78:8081 and I could see the HTML of Flink dashboard UI. This
> proves such public IP is reachable inside the cluster. Just as you
> mentioned, there might still be some network issues with the cluster. I
> will do some further check.
>
>
>
> For problem 2:
>
> I created a new K8S cluster with bastion server with some public IP
> assigned to it. Finally, I can see something valid from my browser. (There
> still exist some problems with connecting to some databases, but I think
> these network problems are not directly related to Flink, I can investigate
> into it later.)
>
>
>
> For problem 3:
>
> Thanks for sharing the repo you created. I am not sure how much work it
> could take to develop a deployer. I understand is depends on the
> proficiency, could you give a rough estimation? If it is too complicated
> and some other options are not significantly inferior to native Kubernetes.
> I might prefer to choose other options. I am currently comparing different
> options to deploy in Kubernetes.
>
>    1. Standalone K8S
>    2. Native Kubernetes
>    3. Flink operator (Google Cloud Platform/ Lyft) [1][2]
>
>
>
> I also watched the demo video you presented. [3] I noticed you mentioned
> that native K8S is not going to replace the other two options. I still
> doesn’t fully get your idea with limited explanation in the demo. Could you
> compare the tradeoff a little bit? Thanks!
>
> [1] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
> <https://urldefense.com/v3/__https:/github.com/GoogleCloudPlatform/flink-on-k8s-operator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY_v6kdJg$>
>
> [2]  https://github.com/lyft/flinkk8soperator
> <https://urldefense.com/v3/__https:/github.com/lyft/flinkk8soperator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sYbe30Oc8$>
>
> [3] https://youtu.be/pdFPr_VOWTU
> <https://urldefense.com/v3/__https:/youtu.be/pdFPr_VOWTU__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY-or9EkA$>
>
>
>
> Best,
>
> Fuyao
>
>
>
>
>
> *From: *Yang Wang <da...@gmail.com>
> *Date: *Tuesday, March 30, 2021 at 19:15
> *To: *Fuyao Li <fu...@oracle.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *Re: [External] : Re: Need help with executing Flink CLI for
> native Kubernetes deployment
>
> Hi Fuyao,
>
>
>
> Thanks for sharing the progress.
>
>
>
> 1. The flink client is able to list/cancel jobs, based on logs shared
> above, I should be able to ping 144.25.13.78, why I still can NOT ping such
> address?
>
>
>
> I think this is a environment problem. Actually, not every IP address
> could be tested with "ping" command. I suggest you to use "telnet
> 144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>"
> to check the network connectivity.
>
>
>
> 2. Why is 144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
> not accessible from outside, I mean on my laptop’s browser. I am within the
> company’s VPN and such public load balancer should expose the flink Web UI,
> right? I tried to debug the network configuration, but failed to find a
> reason, could you give me some hints?
>
>
>
> Just like my above answer, I think you need to check the
> network connectivity via "telnet 144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>".
> Maybe because the firewall is not allowed connecting from your local(e.g.
> your local ip is not in the white list of LoadBalancer IP).
>
>
>
> In production, what is the suggested approach to list and cancel jobs? The
> current manual work of “kubectl exec” into pods is not very reliable.. How
> to automate this process and integrate this CI/CD? Please share some blogs
> there is any, thanks.
>
>
>
> I think in production environment, you should have your own deployer,
> which will take care of submitting the jobs, list/cancel the jobs. Even the
> deployer could help with triggering savepoint and manage the whole
> lifecycle of Flink applications. I used to develop a PoC of
> native-flink-k8s-operator[1]. It could be a start point of your own
> deployer if you want to develop it in JAVA.
>
>
>
> [1]. https://github.com/wangyang0918/flink-native-k8s-operator
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJKkKXY-w$>
>
>
>
>
>
> Best,
>
> Yang
>
>
>
> Fuyao Li <fu...@oracle.com> 于2021年3月31日周三 上午6:37写道:
>
> Hello Yang,
>
>
>
> Thank you so much for providing me the flink-client.yaml. I was able to
> make some progress. I didn’t realize I should create an new pod
> flink-client to list/cancel jobs. I was trying to do such a thing from my
> local laptop. Maybe that is the reason why it doesn’t work. However, I
> still have several questions.
>
>
>
> I created the deployment based on your flink-client.yaml
>
> For the LoadBalancer mode:
>
>
>
> After apply the cluster role binding yaml below.
>
>
>
>
>
> *# https://kubernetes.io/docs/reference/access-authn-authz/rbac/
> <https://urldefense.com/v3/__https:/kubernetes.io/docs/reference/access-authn-authz/rbac/__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJOLipbis$>
> #
> https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services
> <https://urldefense.com/v3/__https:/stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJYqhHkxk$>
> *kind: ClusterRole
> apiVersion: rbac.authorization.k8s.io/v1
> <https://urldefense.com/v3/__http:/rbac.authorization.k8s.io/v1__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJdfnDjyA$>
> metadata:
>   namespace: default
>   name: service-reader
> rules:
> - apiGroups: [""]
> *# "" indicates the core API group   *resources: ["services"]
>   verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>
>
>
> And execute the command:
>
> kubectl create clusterrolebinding service-reader-pod
> --clusterrole=service-reader  --serviceaccount=default:default
>
>
>
> I am able to exec in the flink-client pod and list/cancel jobs.
>
>
>
> $ kubectl exec -it flink-client-776886cf4f-9h47f bash
>
> kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future
> version. Use kubectl exec [POD] -- [COMMAND] instead.
>
> root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target
> kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
>
> 2021-03-30 21:53:14,513 INFO
> org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve
> flink cluster my-first-application-cluster successfully, JobManager Web
> Interface: http://144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
>
> Waiting for response...
>
> ------------------ Running/Restarting Jobs -------------------
>
> 24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming
> Java API Skeleton (RUNNING)
>
> --------------------------------------------------------------
>
> No scheduled jobs.
>
> root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
>
> PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.
>
>
>
> ^C
>
> --- 144.25.13.78 ping statistics ---
>
> 31 packets transmitted, 0 received, 100% packet loss, time 772ms
>
>
>
> Question:
>
>    1. The flink client is able to list/cancel jobs, based on logs shared
>    above, I should be able to ping 144.25.13.78, why I still can NOT ping such
>    address?
>    2. Why is 144.25.13.78:8081
>    <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
>    not accessible from outside, I mean on my laptop’s browser. I am within the
>    company’s VPN and such public load balancer should expose the flink Web UI,
>    right? I tried to debug the network configuration, but failed to find a
>    reason, could you give me some hints?
>    3. In production, what is the suggested approach to list and cancel
>    jobs? The current manual work of “kubectl exec” into pods is not very
>    reliable.. How to automate this process and integrate this CI/CD? Please
>    share some blogs there is any, thanks.
>
>
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Yang Wang <da...@gmail.com>
> *Date: *Monday, March 29, 2021 at 20:40
> *To: *Fuyao Li <fu...@oracle.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[External] : Re: Need help with executing Flink CLI for native
> Kubernetes deployment
>
> Hi Fuyao,
>
>
>
> Thanks for trying the native Kubernetes integration.
>
>
>
> Just like you know, the Flink rest service could be exposed in following
> three types, configured via "kubernetes.rest-service.exposed.type".
>
>
>
> * ClusterIP, which means you could only access the Flink rest endpoint
> inside the K8s cluster. Simply, users could start a Flink client in the
>
> K8s cluster via the following yaml file. And use "kubectl exec" to tunnel
> in the pod to create a Flink session/application cluster. Also the
>
> "flink list/cancel" could work well.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *apiVersion: apps/v1 kind: Deployment metadata:   name: flink-client spec:
>   replicas: 1   selector:     matchLabels:       app: flink-client
> template:     metadata:       labels:         app: flink-client     spec:
>     containers:       - name: client         image: flink:1.12.2
> imagePullPolicy: Always         args: ["sleep", "86400"]*
>
>
>
> * NodePort
>
> Currently, we have a limitation that only the Kubernetes master nodes
> could be used to build the Flink exposed rest endpoint. So if your
>
> APIServer node does not have the kube proxy, then the printed URL in the
> Flink client logs could not be used. We already have a ticket[1] to
>
> support one of the slave nodes for accessing the rest endpoint. But I have
> not managed myself to get it done.
>
>
>
> * LoadBalancer
>
> Is the resolved rest endpoint "http://144.25.13.78:8081/
> <https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>"
> accessible on your Flink client side? If it is yes, then I think the Flink
> client
>
> should be able to contact to JobManager rest server to list/cancel the
> jobs. I have verified in Alibaba container service, and it works well.
>
>
>
>
>
> [1]. https://issues.apache.org/jira/browse/FLINK-16601
> <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>
>
>
>
>
>
> Best,
>
> Yang
>
>
>
> Fuyao Li <fu...@oracle.com> 于2021年3月27日周六 上午5:59写道:
>
> Hi Community, Yang,
>
>
>
> I am new to Flink on native Kubernetes and I am trying to do a POC for
> native Kubernetes application mode on Oracle Cloud Infrastructure. I was
> following the documentation here step by step: [1]
>
>
>
> I am using Flink 1.12.1, Scala 2.11, java 11.
>
> I was able to create a native Kubernetes Deployment, but I am not able to
> use any further commands like list / cancel etc.. I always run into timeout
> error. I think the issue could be the JobManager Web Interface IP address
> printed after job deployment is not accessible. This issue is causing me
> not able to shut down the deployment with a savepoint. It could be
> Kubernetes configuration issue. I have exposed all related ports traffic
> and validated the security list, but still couldn’t make it work. Any help
> is appreciated.
>
>
>
>
>
> The relevant Flink source code is CliFrontend.java class [2]
>
> The ./bin/flink list and cancel command is trying to send traffic to the
> Flink dashboard UI IP address and it gets timeout. I tried to both
> LoadBalancer and NodePort option for
> -Dkubernetes.rest-service.exposed.type configuration. Both of them
> doesn’t work.
>
>
>
> # List running job on the cluster (I can’t execute this command
> successfully due to timeout, logs shared below)
>
> $ ./bin/flink list --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster
>
> # Cancel running job (I can’t execute this command succcessfully)
>
> $ ./bin/flink cancel --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster <jobId>
>
>
>
> I think those commands needs to communicate with the endpoint that shows
> after the the job submission command.
>
>
>
>    1. Use case 1(deploy with NodePort)
>
>
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1
> <https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$>
> \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=NodePort \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> When the expose type is NodePort, the printed messages says the the Flink
> JobManager Web Interface:is at http://192.29.104.156:30996
> <https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>
> 192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that
> exposes the service. However, Flink dashboard in this address is not
> resolvable.
>
> I can only get access to dashboard UI on each node IP address(There are
> three nodes in my K8S cluster)
>
> 100.104.154.73:30996
> <https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
>
> 100.104.154.74:30996
> <https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
>
> 100.104.154.75:30996
> <https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
>
>       I got the following errors when trying to do list command for such a
> native Kubernetes deployment. See in [4]. *According to the documentation
> here [3], this shouldn’t happen since Kubernetes api server address should
> also have the Flink Web UI… Did I miss any configurations in Kubernetes to
> make webUI available in Kubernetes apiserver address?*
>
>
>
>
>
>    1. Use case 2 (deploy with LoadBalancer)
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1
> <https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$>
> \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=LoadBalancer \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> After a while, when the external IP is resolved. It said Flink JobManager
> web interface is at the external-IP (LOAD BALANCER address) at:
> http://144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
>
> When I execute the list command, I still got error after waiting for long
> time to let it get timeout. See errors here. [5]
>
>
>
> I can still get access to NodeIP:<service-port>. In such case, I tend to
> believe it is a network issue. But still quite confused since I am already
> open all the traffics..
>
>
>
>
>
>
>
>
>
> Reference:
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
> <https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
>
> [2]
> https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
> <https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
>
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui
> <https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
>
> [4] https://pastebin.ubuntu.com/p/WcJMwds52r/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
>
> [5] https://pastebin.ubuntu.com/p/m27BnQGXQc/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>
>
>
>
>
>
> Thanks for your help in advance.
>
>
>
> Best regards,
>
> Fuyao
>
>
>
>
>
>

Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Fuyao Li <fu...@oracle.com>.
Hello Yang,

I am just following up the previous email to see if you got some time to reply.
I also took a deeper look into lyft k8s operator recently. It seems it doesn’t support HA natively. It still needs the help of ZooKeeper. In terms of this, native k8s is better. Any other ideas? Thanks for your help.

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Thursday, April 1, 2021 at 12:22
To: Yang Wang <da...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Yang,

Thanks for sharing the insights.

For problem 1:
I think I can’t do telnet in the container. I tried to use curl 144.25.13.78:8081 and I could see the HTML of Flink dashboard UI. This proves such public IP is reachable inside the cluster. Just as you mentioned, there might still be some network issues with the cluster. I will do some further check.

For problem 2:
I created a new K8S cluster with bastion server with some public IP assigned to it. Finally, I can see something valid from my browser. (There still exist some problems with connecting to some databases, but I think these network problems are not directly related to Flink, I can investigate into it later.)

For problem 3:
Thanks for sharing the repo you created. I am not sure how much work it could take to develop a deployer. I understand is depends on the proficiency, could you give a rough estimation? If it is too complicated and some other options are not significantly inferior to native Kubernetes. I might prefer to choose other options. I am currently comparing different options to deploy in Kubernetes.

  1.  Standalone K8S
  2.  Native Kubernetes
  3.  Flink operator (Google Cloud Platform/ Lyft) [1][2]

I also watched the demo video you presented. [3] I noticed you mentioned that native K8S is not going to replace the other two options. I still doesn’t fully get your idea with limited explanation in the demo. Could you compare the tradeoff a little bit? Thanks!
[1] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator<https://urldefense.com/v3/__https:/github.com/GoogleCloudPlatform/flink-on-k8s-operator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY_v6kdJg$>
[2]  https://github.com/lyft/flinkk8soperator<https://urldefense.com/v3/__https:/github.com/lyft/flinkk8soperator__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sYbe30Oc8$>
[3] https://youtu.be/pdFPr_VOWTU<https://urldefense.com/v3/__https:/youtu.be/pdFPr_VOWTU__;!!GqivPVa7Brio!PSI2zQyxn7aqBiBu0QwpyZPQfjtARF7Q_rImBKFsEo_qREmnZ7nae9sY-or9EkA$>

Best,
Fuyao


From: Yang Wang <da...@gmail.com>
Date: Tuesday, March 30, 2021 at 19:15
To: Fuyao Li <fu...@oracle.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for sharing the progress.

1. The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?

I think this is a environment problem. Actually, not every IP address could be tested with "ping" command. I suggest you to use "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>" to check the network connectivity.

2. Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?

Just like my above answer, I think you need to check the network connectivity via "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>". Maybe because the firewall is not allowed connecting from your local(e.g. your local ip is not in the white list of LoadBalancer IP).

In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.

I think in production environment, you should have your own deployer, which will take care of submitting the jobs, list/cancel the jobs. Even the deployer could help with triggering savepoint and manage the whole lifecycle of Flink applications. I used to develop a PoC of native-flink-k8s-operator[1]. It could be a start point of your own deployer if you want to develop it in JAVA.

[1]. https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJKkKXY-w$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月31日周三 上午6:37写道:
Hello Yang,

Thank you so much for providing me the flink-client.yaml. I was able to make some progress. I didn’t realize I should create an new pod flink-client to list/cancel jobs. I was trying to do such a thing from my local laptop. Maybe that is the reason why it doesn’t work. However, I still have several questions.

I created the deployment based on your flink-client.yaml
For the LoadBalancer mode:

After apply the cluster role binding yaml below.

# https://kubernetes.io/docs/reference/access-authn-authz/rbac/<https://urldefense.com/v3/__https:/kubernetes.io/docs/reference/access-authn-authz/rbac/__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJOLipbis$>
# https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services<https://urldefense.com/v3/__https:/stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJYqhHkxk$>
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1<https://urldefense.com/v3/__http:/rbac.authorization.k8s.io/v1__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJdfnDjyA$>
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

And execute the command:
kubectl create clusterrolebinding service-reader-pod  --clusterrole=service-reader  --serviceaccount=default:default

I am able to exec in the flink-client pod and list/cancel jobs.

$ kubectl exec -it flink-client-776886cf4f-9h47f bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
2021-03-30 21:53:14,513 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster my-first-application-cluster successfully, JobManager Web Interface: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
Waiting for response...
------------------ Running/Restarting Jobs -------------------
24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming Java API Skeleton (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.

^C
--- 144.25.13.78 ping statistics ---
31 packets transmitted, 0 received, 100% packet loss, time 772ms

Question:

  1.  The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?
  2.  Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?
  3.  In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.


Best,
Fuyao

From: Yang Wang <da...@gmail.com>>
Date: Monday, March 29, 2021 at 20:40
To: Fuyao Li <fu...@oracle.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for trying the native Kubernetes integration.

Just like you know, the Flink rest service could be exposed in following three types, configured via "kubernetes.rest-service.exposed.type".

* ClusterIP, which means you could only access the Flink rest endpoint inside the K8s cluster. Simply, users could start a Flink client in the
K8s cluster via the following yaml file. And use "kubectl exec" to tunnel in the pod to create a Flink session/application cluster. Also the
"flink list/cancel" could work well.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink-client
  template:
    metadata:
      labels:
        app: flink-client
    spec:
      containers:
      - name: client
        image: flink:1.12.2
        imagePullPolicy: Always
        args: ["sleep", "86400"]

* NodePort
Currently, we have a limitation that only the Kubernetes master nodes could be used to build the Flink exposed rest endpoint. So if your
APIServer node does not have the kube proxy, then the printed URL in the Flink client logs could not be used. We already have a ticket[1] to
support one of the slave nodes for accessing the rest endpoint. But I have not managed myself to get it done.

* LoadBalancer
Is the resolved rest endpoint "http://144.25.13.78:8081/<https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>" accessible on your Flink client side? If it is yes, then I think the Flink client
should be able to contact to JobManager rest server to list/cancel the jobs. I have verified in Alibaba container service, and it works well.


[1]. https://issues.apache.org/jira/browse/FLINK-16601<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月27日周六 上午5:59写道:
Hi Community, Yang,

I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]

I am using Flink 1.12.1, Scala 2.11, java 11.
I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated.


The relevant Flink source code is CliFrontend.java class [2]
The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work.

# List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below)
$ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
# Cancel running job (I can’t execute this command succcessfully)
$ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId>

I think those commands needs to communicate with the endpoint that shows after the the job submission command.


  1.  Use case 1(deploy with NodePort)

# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=NodePort \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


When the expose type is NodePort, the printed messages says the the Flink  JobManager Web Interface:is at http://192.29.104.156:30996<https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>  192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable.
I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster)
100.104.154.73:30996<https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
100.104.154.74:30996<https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
100.104.154.75:30996<https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
      I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address?



  1.  Use case 2 (deploy with LoadBalancer)
# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=LoadBalancer \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5]

I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics..




Reference:
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
[2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
[4] https://pastebin.ubuntu.com/p/WcJMwds52r/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
[5] https://pastebin.ubuntu.com/p/m27BnQGXQc/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>


Thanks for your help in advance.

Best regards,
Fuyao



Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Fuyao Li <fu...@oracle.com>.
Hi Yang,

Thanks for sharing the insights.

For problem 1:
I think I can’t do telnet in the container. I tried to use curl 144.25.13.78:8081 and I could see the HTML of Flink dashboard UI. This proves such public IP is reachable inside the cluster. Just as you mentioned, there might still be some network issues with the cluster. I will do some further check.

For problem 2:
I created a new K8S cluster with bastion server with some public IP assigned to it. Finally, I can see something valid from my browser. (There still exist some problems with connecting to some databases, but I think these network problems are not directly related to Flink, I can investigate into it later.)

For problem 3:
Thanks for sharing the repo you created. I am not sure how much work it could take to develop a deployer. I understand is depends on the proficiency, could you give a rough estimation? If it is too complicated and some other options are not significantly inferior to native Kubernetes. I might prefer to choose other options. I am currently comparing different options to deploy in Kubernetes.

  1.  Standalone K8S
  2.  Native Kubernetes
  3.  Flink operator (Google Cloud Platform/ Lyft) [1][2]

I also watched the demo video you presented. [3] I noticed you mentioned that native K8S is not going to replace the other two options. I still doesn’t fully get your idea with limited explanation in the demo. Could you compare the tradeoff a little bit? Thanks!
[1] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
[2]  https://github.com/lyft/flinkk8soperator
[3] https://youtu.be/pdFPr_VOWTU

Best,
Fuyao


From: Yang Wang <da...@gmail.com>
Date: Tuesday, March 30, 2021 at 19:15
To: Fuyao Li <fu...@oracle.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for sharing the progress.

1. The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?

I think this is a environment problem. Actually, not every IP address could be tested with "ping" command. I suggest you to use "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>" to check the network connectivity.

2. Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?

Just like my above answer, I think you need to check the network connectivity via "telnet 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>". Maybe because the firewall is not allowed connecting from your local(e.g. your local ip is not in the white list of LoadBalancer IP).

In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.

I think in production environment, you should have your own deployer, which will take care of submitting the jobs, list/cancel the jobs. Even the deployer could help with triggering savepoint and manage the whole lifecycle of Flink applications. I used to develop a PoC of native-flink-k8s-operator[1]. It could be a start point of your own deployer if you want to develop it in JAVA.

[1]. https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJKkKXY-w$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月31日周三 上午6:37写道:
Hello Yang,

Thank you so much for providing me the flink-client.yaml. I was able to make some progress. I didn’t realize I should create an new pod flink-client to list/cancel jobs. I was trying to do such a thing from my local laptop. Maybe that is the reason why it doesn’t work. However, I still have several questions.

I created the deployment based on your flink-client.yaml
For the LoadBalancer mode:

After apply the cluster role binding yaml below.

# https://kubernetes.io/docs/reference/access-authn-authz/rbac/<https://urldefense.com/v3/__https:/kubernetes.io/docs/reference/access-authn-authz/rbac/__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJOLipbis$>
# https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services<https://urldefense.com/v3/__https:/stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJYqhHkxk$>
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1<https://urldefense.com/v3/__http:/rbac.authorization.k8s.io/v1__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJdfnDjyA$>
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

And execute the command:
kubectl create clusterrolebinding service-reader-pod  --clusterrole=service-reader  --serviceaccount=default:default

I am able to exec in the flink-client pod and list/cancel jobs.

$ kubectl exec -it flink-client-776886cf4f-9h47f bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
2021-03-30 21:53:14,513 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster my-first-application-cluster successfully, JobManager Web Interface: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$>
Waiting for response...
------------------ Running/Restarting Jobs -------------------
24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming Java API Skeleton (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.

^C
--- 144.25.13.78 ping statistics ---
31 packets transmitted, 0 received, 100% packet loss, time 772ms

Question:

  1.  The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?
  2.  Why is 144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!Mt0Yl_83qJjPOz_3d-057sy-jUkLGLgH3ooSU4r3sg4-wezeZCrvwJHJJ4sqJn4$> not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?
  3.  In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.


Best,
Fuyao

From: Yang Wang <da...@gmail.com>>
Date: Monday, March 29, 2021 at 20:40
To: Fuyao Li <fu...@oracle.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for trying the native Kubernetes integration.

Just like you know, the Flink rest service could be exposed in following three types, configured via "kubernetes.rest-service.exposed.type".

* ClusterIP, which means you could only access the Flink rest endpoint inside the K8s cluster. Simply, users could start a Flink client in the
K8s cluster via the following yaml file. And use "kubectl exec" to tunnel in the pod to create a Flink session/application cluster. Also the
"flink list/cancel" could work well.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink-client
  template:
    metadata:
      labels:
        app: flink-client
    spec:
      containers:
      - name: client
        image: flink:1.12.2
        imagePullPolicy: Always
        args: ["sleep", "86400"]

* NodePort
Currently, we have a limitation that only the Kubernetes master nodes could be used to build the Flink exposed rest endpoint. So if your
APIServer node does not have the kube proxy, then the printed URL in the Flink client logs could not be used. We already have a ticket[1] to
support one of the slave nodes for accessing the rest endpoint. But I have not managed myself to get it done.

* LoadBalancer
Is the resolved rest endpoint "http://144.25.13.78:8081/<https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>" accessible on your Flink client side? If it is yes, then I think the Flink client
should be able to contact to JobManager rest server to list/cancel the jobs. I have verified in Alibaba container service, and it works well.


[1]. https://issues.apache.org/jira/browse/FLINK-16601<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月27日周六 上午5:59写道:
Hi Community, Yang,

I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]

I am using Flink 1.12.1, Scala 2.11, java 11.
I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated.


The relevant Flink source code is CliFrontend.java class [2]
The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work.

# List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below)
$ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
# Cancel running job (I can’t execute this command succcessfully)
$ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId>

I think those commands needs to communicate with the endpoint that shows after the the job submission command.


  1.  Use case 1(deploy with NodePort)

# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=NodePort \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


When the expose type is NodePort, the printed messages says the the Flink  JobManager Web Interface:is at http://192.29.104.156:30996<https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>  192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable.
I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster)
100.104.154.73:30996<https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
100.104.154.74:30996<https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
100.104.154.75:30996<https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
      I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address?



  1.  Use case 2 (deploy with LoadBalancer)
# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=LoadBalancer \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5]

I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics..




Reference:
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
[2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
[4] https://pastebin.ubuntu.com/p/WcJMwds52r/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
[5] https://pastebin.ubuntu.com/p/m27BnQGXQc/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>


Thanks for your help in advance.

Best regards,
Fuyao



Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Yang Wang <da...@gmail.com>.
Hi Fuyao,

Thanks for sharing the progress.

1. The flink client is able to list/cancel jobs, based on logs shared
> above, I should be able to ping 144.25.13.78, why I still can NOT ping such
> address?


I think this is a environment problem. Actually, not every IP address could
be tested with "ping" command. I suggest you to use "telnet
144.25.13.78:8081" to check the network connectivity.

2. Why is 144.25.13.78:8081 not accessible from outside, I mean on my
> laptop’s browser. I am within the company’s VPN and such public load
> balancer should expose the flink Web UI, right? I tried to debug the
> network configuration, but failed to find a reason, could you give me some
> hints?


Just like my above answer, I think you need to check the
network connectivity via "telnet 144.25.13.78:8081". Maybe because the
firewall is not allowed connecting from your local(e.g. your local ip is
not in the white list of LoadBalancer IP).

In production, what is the suggested approach to list and cancel jobs? The
> current manual work of “kubectl exec” into pods is not very reliable.. How
> to automate this process and integrate this CI/CD? Please share some blogs
> there is any, thanks.


I think in production environment, you should have your own deployer, which
will take care of submitting the jobs, list/cancel the jobs. Even the
deployer could help with triggering savepoint and manage the whole
lifecycle of Flink applications. I used to develop a PoC of
native-flink-k8s-operator[1]. It could be a start point of your own
deployer if you want to develop it in JAVA.

[1]. https://github.com/wangyang0918/flink-native-k8s-operator


Best,
Yang

Fuyao Li <fu...@oracle.com> 于2021年3月31日周三 上午6:37写道:

> Hello Yang,
>
>
>
> Thank you so much for providing me the flink-client.yaml. I was able to
> make some progress. I didn’t realize I should create an new pod
> flink-client to list/cancel jobs. I was trying to do such a thing from my
> local laptop. Maybe that is the reason why it doesn’t work. However, I
> still have several questions.
>
>
>
> I created the deployment based on your flink-client.yaml
>
> For the LoadBalancer mode:
>
>
>
> After apply the cluster role binding yaml below.
>
>
>
>
>
> *# https://kubernetes.io/docs/reference/access-authn-authz/rbac/
> <https://kubernetes.io/docs/reference/access-authn-authz/rbac/> #
> https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services
> <https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services>
> *kind: ClusterRole
> apiVersion: rbac.authorization.k8s.io/v1
> metadata:
>   namespace: default
>   name: service-reader
> rules:
> - apiGroups: [""]
> *# "" indicates the core API group   *resources: ["services"]
>   verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>
>
>
> And execute the command:
>
> kubectl create clusterrolebinding service-reader-pod
> --clusterrole=service-reader  --serviceaccount=default:default
>
>
>
> I am able to exec in the flink-client pod and list/cancel jobs.
>
>
>
> $ kubectl exec -it flink-client-776886cf4f-9h47f bash
>
> kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future
> version. Use kubectl exec [POD] -- [COMMAND] instead.
>
> root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target
> kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
>
> 2021-03-30 21:53:14,513 INFO
> org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve
> flink cluster my-first-application-cluster successfully, JobManager Web
> Interface: http://144.25.13.78:8081
>
> Waiting for response...
>
> ------------------ Running/Restarting Jobs -------------------
>
> 24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming
> Java API Skeleton (RUNNING)
>
> --------------------------------------------------------------
>
> No scheduled jobs.
>
> root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
>
> PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.
>
>
>
> ^C
>
> --- 144.25.13.78 ping statistics ---
>
> 31 packets transmitted, 0 received, 100% packet loss, time 772ms
>
>
>
> Question:
>
>    1. The flink client is able to list/cancel jobs, based on logs shared
>    above, I should be able to ping 144.25.13.78, why I still can NOT ping such
>    address?
>    2. Why is 144.25.13.78:8081 not accessible from outside, I mean on my
>    laptop’s browser. I am within the company’s VPN and such public load
>    balancer should expose the flink Web UI, right? I tried to debug the
>    network configuration, but failed to find a reason, could you give me some
>    hints?
>    3. In production, what is the suggested approach to list and cancel
>    jobs? The current manual work of “kubectl exec” into pods is not very
>    reliable.. How to automate this process and integrate this CI/CD? Please
>    share some blogs there is any, thanks.
>
>
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Yang Wang <da...@gmail.com>
> *Date: *Monday, March 29, 2021 at 20:40
> *To: *Fuyao Li <fu...@oracle.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[External] : Re: Need help with executing Flink CLI for native
> Kubernetes deployment
>
> Hi Fuyao,
>
>
>
> Thanks for trying the native Kubernetes integration.
>
>
>
> Just like you know, the Flink rest service could be exposed in following
> three types, configured via "kubernetes.rest-service.exposed.type".
>
>
>
> * ClusterIP, which means you could only access the Flink rest endpoint
> inside the K8s cluster. Simply, users could start a Flink client in the
>
> K8s cluster via the following yaml file. And use "kubectl exec" to tunnel
> in the pod to create a Flink session/application cluster. Also the
>
> "flink list/cancel" could work well.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *apiVersion: apps/v1 kind: Deployment metadata:   name: flink-client spec:
>   replicas: 1   selector:     matchLabels:       app: flink-client
> template:     metadata:       labels:         app: flink-client     spec:
>     containers:       - name: client         image: flink:1.12.2
> imagePullPolicy: Always         args: ["sleep", "86400"]*
>
>
>
> * NodePort
>
> Currently, we have a limitation that only the Kubernetes master nodes
> could be used to build the Flink exposed rest endpoint. So if your
>
> APIServer node does not have the kube proxy, then the printed URL in the
> Flink client logs could not be used. We already have a ticket[1] to
>
> support one of the slave nodes for accessing the rest endpoint. But I have
> not managed myself to get it done.
>
>
>
> * LoadBalancer
>
> Is the resolved rest endpoint "http://144.25.13.78:8081/
> <https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>"
> accessible on your Flink client side? If it is yes, then I think the Flink
> client
>
> should be able to contact to JobManager rest server to list/cancel the
> jobs. I have verified in Alibaba container service, and it works well.
>
>
>
>
>
> [1]. https://issues.apache.org/jira/browse/FLINK-16601
> <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>
>
>
>
>
>
> Best,
>
> Yang
>
>
>
> Fuyao Li <fu...@oracle.com> 于2021年3月27日周六 上午5:59写道:
>
> Hi Community, Yang,
>
>
>
> I am new to Flink on native Kubernetes and I am trying to do a POC for
> native Kubernetes application mode on Oracle Cloud Infrastructure. I was
> following the documentation here step by step: [1]
>
>
>
> I am using Flink 1.12.1, Scala 2.11, java 11.
>
> I was able to create a native Kubernetes Deployment, but I am not able to
> use any further commands like list / cancel etc.. I always run into timeout
> error. I think the issue could be the JobManager Web Interface IP address
> printed after job deployment is not accessible. This issue is causing me
> not able to shut down the deployment with a savepoint. It could be
> Kubernetes configuration issue. I have exposed all related ports traffic
> and validated the security list, but still couldn’t make it work. Any help
> is appreciated.
>
>
>
>
>
> The relevant Flink source code is CliFrontend.java class [2]
>
> The ./bin/flink list and cancel command is trying to send traffic to the
> Flink dashboard UI IP address and it gets timeout. I tried to both
> LoadBalancer and NodePort option for
> -Dkubernetes.rest-service.exposed.type configuration. Both of them
> doesn’t work.
>
>
>
> # List running job on the cluster (I can’t execute this command
> successfully due to timeout, logs shared below)
>
> $ ./bin/flink list --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster
>
> # Cancel running job (I can’t execute this command succcessfully)
>
> $ ./bin/flink cancel --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster <jobId>
>
>
>
> I think those commands needs to communicate with the endpoint that shows
> after the the job submission command.
>
>
>
>    1. Use case 1(deploy with NodePort)
>
>
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1
> <https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$>
> \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=NodePort \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> When the expose type is NodePort, the printed messages says the the Flink
> JobManager Web Interface:is at http://192.29.104.156:30996
> <https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>
> 192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that
> exposes the service. However, Flink dashboard in this address is not
> resolvable.
>
> I can only get access to dashboard UI on each node IP address(There are
> three nodes in my K8S cluster)
>
> 100.104.154.73:30996
> <https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
>
> 100.104.154.74:30996
> <https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
>
> 100.104.154.75:30996
> <https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
>
>       I got the following errors when trying to do list command for such a
> native Kubernetes deployment. See in [4]. *According to the documentation
> here [3], this shouldn’t happen since Kubernetes api server address should
> also have the Flink Web UI… Did I miss any configurations in Kubernetes to
> make webUI available in Kubernetes apiserver address?*
>
>
>
>
>
>    1. Use case 2 (deploy with LoadBalancer)
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1
> <https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$>
> \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=LoadBalancer \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> After a while, when the external IP is resolved. It said Flink JobManager
> web interface is at the external-IP (LOAD BALANCER address) at:
> http://144.25.13.78:8081
> <https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
>
> When I execute the list command, I still got error after waiting for long
> time to let it get timeout. See errors here. [5]
>
>
>
> I can still get access to NodeIP:<service-port>. In such case, I tend to
> believe it is a network issue. But still quite confused since I am already
> open all the traffics..
>
>
>
>
>
>
>
>
>
> Reference:
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
> <https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
>
> [2]
> https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
> <https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
>
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui
> <https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
>
> [4] https://pastebin.ubuntu.com/p/WcJMwds52r/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
>
> [5] https://pastebin.ubuntu.com/p/m27BnQGXQc/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>
>
>
>
>
>
> Thanks for your help in advance.
>
>
>
> Best regards,
>
> Fuyao
>
>
>
>
>
>

Re: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Fuyao Li <fu...@oracle.com>.
Hello Yang,

Thank you so much for providing me the flink-client.yaml. I was able to make some progress. I didn’t realize I should create an new pod flink-client to list/cancel jobs. I was trying to do such a thing from my local laptop. Maybe that is the reason why it doesn’t work. However, I still have several questions.

I created the deployment based on your flink-client.yaml
For the LoadBalancer mode:

After apply the cluster role binding yaml below.

# https://kubernetes.io/docs/reference/access-authn-authz/rbac/
# https://stackoverflow.com/questions/47973570/kubernetes-log-user-systemserviceaccountdefaultdefault-cannot-get-services
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

And execute the command:
kubectl create clusterrolebinding service-reader-pod  --clusterrole=service-reader  --serviceaccount=default:default

I am able to exec in the flink-client pod and list/cancel jobs.

$ kubectl exec -it flink-client-776886cf4f-9h47f bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@flink-client-776886cf4f-9h47f:/opt/flink# ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
2021-03-30 21:53:14,513 INFO  org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve flink cluster my-first-application-cluster successfully, JobManager Web Interface: http://144.25.13.78:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
24.03.2021 00:13:04 : eea39629a1931b67eb395207739455ce : Flink Streaming Java API Skeleton (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
root@flink-client-776886cf4f-9h47f:/opt/flink# ping 144.25.13.78
PING 144.25.13.78 (144.25.13.78) 56(84) bytes of data.

^C
--- 144.25.13.78 ping statistics ---
31 packets transmitted, 0 received, 100% packet loss, time 772ms

Question:

  1.  The flink client is able to list/cancel jobs, based on logs shared above, I should be able to ping 144.25.13.78, why I still can NOT ping such address?
  2.  Why is 144.25.13.78:8081 not accessible from outside, I mean on my laptop’s browser. I am within the company’s VPN and such public load balancer should expose the flink Web UI, right? I tried to debug the network configuration, but failed to find a reason, could you give me some hints?
  3.  In production, what is the suggested approach to list and cancel jobs? The current manual work of “kubectl exec” into pods is not very reliable.. How to automate this process and integrate this CI/CD? Please share some blogs there is any, thanks.


Best,
Fuyao

From: Yang Wang <da...@gmail.com>
Date: Monday, March 29, 2021 at 20:40
To: Fuyao Li <fu...@oracle.com>
Cc: user <us...@flink.apache.org>
Subject: [External] : Re: Need help with executing Flink CLI for native Kubernetes deployment
Hi Fuyao,

Thanks for trying the native Kubernetes integration.

Just like you know, the Flink rest service could be exposed in following three types, configured via "kubernetes.rest-service.exposed.type".

* ClusterIP, which means you could only access the Flink rest endpoint inside the K8s cluster. Simply, users could start a Flink client in the
K8s cluster via the following yaml file. And use "kubectl exec" to tunnel in the pod to create a Flink session/application cluster. Also the
"flink list/cancel" could work well.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flink-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flink-client
  template:
    metadata:
      labels:
        app: flink-client
    spec:
      containers:
      - name: client
        image: flink:1.12.2
        imagePullPolicy: Always
        args: ["sleep", "86400"]

* NodePort
Currently, we have a limitation that only the Kubernetes master nodes could be used to build the Flink exposed rest endpoint. So if your
APIServer node does not have the kube proxy, then the printed URL in the Flink client logs could not be used. We already have a ticket[1] to
support one of the slave nodes for accessing the rest endpoint. But I have not managed myself to get it done.

* LoadBalancer
Is the resolved rest endpoint "http://144.25.13.78:8081/<https://urldefense.com/v3/__http:/144.25.13.78:8081/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL6Tas5bs$>" accessible on your Flink client side? If it is yes, then I think the Flink client
should be able to contact to JobManager rest server to list/cancel the jobs. I have verified in Alibaba container service, and it works well.


[1]. https://issues.apache.org/jira/browse/FLINK-16601<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-16601__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLq6qn8eI$>


Best,
Yang

Fuyao Li <fu...@oracle.com>> 于2021年3月27日周六 上午5:59写道:
Hi Community, Yang,

I am new to Flink on native Kubernetes and I am trying to do a POC for native Kubernetes application mode on Oracle Cloud Infrastructure. I was following the documentation here step by step: [1]

I am using Flink 1.12.1, Scala 2.11, java 11.
I was able to create a native Kubernetes Deployment, but I am not able to use any further commands like list / cancel etc.. I always run into timeout error. I think the issue could be the JobManager Web Interface IP address printed after job deployment is not accessible. This issue is causing me not able to shut down the deployment with a savepoint. It could be Kubernetes configuration issue. I have exposed all related ports traffic and validated the security list, but still couldn’t make it work. Any help is appreciated.


The relevant Flink source code is CliFrontend.java class [2]
The ./bin/flink list and cancel command is trying to send traffic to the Flink dashboard UI IP address and it gets timeout. I tried to both LoadBalancer and NodePort option for -Dkubernetes.rest-service.exposed.type configuration. Both of them doesn’t work.

# List running job on the cluster (I can’t execute this command successfully due to timeout, logs shared below)
$ ./bin/flink list --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster
# Cancel running job (I can’t execute this command succcessfully)
$ ./bin/flink cancel --target kubernetes-application -Dkubernetes.cluster-id=my-first-application-cluster <jobId>

I think those commands needs to communicate with the endpoint that shows after the the job submission command.


  1.  Use case 1(deploy with NodePort)

# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=NodePort \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


When the expose type is NodePort, the printed messages says the the Flink  JobManager Web Interface:is at http://192.29.104.156:30996<https://urldefense.com/v3/__http:/192.29.104.156:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLWj8xqLw$>  192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that exposes the service. However, Flink dashboard in this address is not resolvable.
I can only get access to dashboard UI on each node IP address(There are three nodes in my K8S cluster)
100.104.154.73:30996<https://urldefense.com/v3/__http:/100.104.154.73:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaL-HlxI1k$>
100.104.154.74:30996<https://urldefense.com/v3/__http:/100.104.154.74:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm5bFdek$>
100.104.154.75:30996<https://urldefense.com/v3/__http:/100.104.154.75:30996__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLnWEyMk0$>
      I got the following errors when trying to do list command for such a native Kubernetes deployment. See in [4]. According to the documentation here [3], this shouldn’t happen since Kubernetes api server address should also have the Flink Web UI… Did I miss any configurations in Kubernetes to make webUI available in Kubernetes apiserver address?



  1.  Use case 2 (deploy with LoadBalancer)
# fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
$ ./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=my-first-application-cluster \
    -Dkubernetes.container.image=us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1<https://urldefense.com/v3/__http:/us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLQ0iTWx4$> \
    -Dkubernetes.container.image.pull-policy=IfNotPresent \
    -Dkubernetes.container.image.pull-secrets=ocirsecret \
    -Dkubernetes.rest-service.exposed.type=LoadBalancer \
    -Dkubernetes.service-account=flink-service-account \
local:///opt/flink/usrlib/quickstart-0.1.jar


After a while, when the external IP is resolved. It said Flink JobManager web interface is at the external-IP (LOAD BALANCER address) at: http://144.25.13.78:8081<https://urldefense.com/v3/__http:/144.25.13.78:8081__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLAjCuUF4$>
When I execute the list command, I still got error after waiting for long time to let it get timeout. See errors here. [5]

I can still get access to NodeIP:<service-port>. In such case, I tend to believe it is a network issue. But still quite confused since I am already open all the traffics..




Reference:
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLm9-fR2c$>
[2] https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLBUAEvm0$>
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html*accessing-flinks-web-ui__;Iw!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLDPdM4vg$>
[4] https://pastebin.ubuntu.com/p/WcJMwds52r/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/WcJMwds52r/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLv3FzXiw$>
[5] https://pastebin.ubuntu.com/p/m27BnQGXQc/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/m27BnQGXQc/__;!!GqivPVa7Brio!MEg0isX5VoPxvAeBA5KGLMydlfMhTvjVoI-5fjvprud4hyKk4cnhRZaLIWmzcx0$>


Thanks for your help in advance.

Best regards,
Fuyao



Re: Need help with executing Flink CLI for native Kubernetes deployment

Posted by Yang Wang <da...@gmail.com>.
Hi Fuyao,

Thanks for trying the native Kubernetes integration.

Just like you know, the Flink rest service could be exposed in following
three types, configured via "kubernetes.rest-service.exposed.type".

* ClusterIP, which means you could only access the Flink rest endpoint
inside the K8s cluster. Simply, users could start a Flink client in the
K8s cluster via the following yaml file. And use "kubectl exec" to tunnel
in the pod to create a Flink session/application cluster. Also the
"flink list/cancel" could work well.



















*apiVersion: apps/v1kind: Deploymentmetadata:  name: flink-clientspec:
replicas: 1  selector:    matchLabels:      app: flink-client  template:
metadata:      labels:        app: flink-client    spec:      containers:
    - name: client        image: flink:1.12.2        imagePullPolicy:
Always        args: ["sleep", "86400"]*

* NodePort
Currently, we have a limitation that only the Kubernetes master nodes could
be used to build the Flink exposed rest endpoint. So if your
APIServer node does not have the kube proxy, then the printed URL in the
Flink client logs could not be used. We already have a ticket[1] to
support one of the slave nodes for accessing the rest endpoint. But I have
not managed myself to get it done.

* LoadBalancer
Is the resolved rest endpoint "http://144.25.13.78:8081/" accessible on
your Flink client side? If it is yes, then I think the Flink client
should be able to contact to JobManager rest server to list/cancel the
jobs. I have verified in Alibaba container service, and it works well.


[1]. https://issues.apache.org/jira/browse/FLINK-16601


Best,
Yang

Fuyao Li <fu...@oracle.com> 于2021年3月27日周六 上午5:59写道:

> Hi Community, Yang,
>
>
>
> I am new to Flink on native Kubernetes and I am trying to do a POC for
> native Kubernetes application mode on Oracle Cloud Infrastructure. I was
> following the documentation here step by step: [1]
>
>
>
> I am using Flink 1.12.1, Scala 2.11, java 11.
>
> I was able to create a native Kubernetes Deployment, but I am not able to
> use any further commands like list / cancel etc.. I always run into timeout
> error. I think the issue could be the JobManager Web Interface IP address
> printed after job deployment is not accessible. This issue is causing me
> not able to shut down the deployment with a savepoint. It could be
> Kubernetes configuration issue. I have exposed all related ports traffic
> and validated the security list, but still couldn’t make it work. Any help
> is appreciated.
>
>
>
>
>
> The relevant Flink source code is CliFrontend.java class [2]
>
> The ./bin/flink list and cancel command is trying to send traffic to the
> Flink dashboard UI IP address and it gets timeout. I tried to both
> LoadBalancer and NodePort option for
> -Dkubernetes.rest-service.exposed.type configuration. Both of them
> doesn’t work.
>
>
>
> # List running job on the cluster (I can’t execute this command
> successfully due to timeout, logs shared below)
>
> $ ./bin/flink list --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster
>
> # Cancel running job (I can’t execute this command succcessfully)
>
> $ ./bin/flink cancel --target kubernetes-application
> -Dkubernetes.cluster-id=my-first-application-cluster <jobId>
>
>
>
> I think those commands needs to communicate with the endpoint that shows
> after the the job submission command.
>
>
>
>    1. Use case 1(deploy with NodePort)
>
>
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=NodePort \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> When the expose type is NodePort, the printed messages says the the Flink
> JobManager Web Interface:is at http://192.29.104.156:30996
> 192.29.104.156 is my Kubernetes apiserver address. 30996 is the port that
> exposes the service. However, Flink dashboard in this address is not
> resolvable.
>
> I can only get access to dashboard UI on each node IP address(There are
> three nodes in my K8S cluster)
>
> 100.104.154.73:30996
>
> 100.104.154.74:30996
>
> 100.104.154.75:30996
>
>       I got the following errors when trying to do list command for such a
> native Kubernetes deployment. See in [4]. *According to the documentation
> here [3], this shouldn’t happen since Kubernetes api server address should
> also have the Flink Web UI… Did I miss any configurations in Kubernetes to
> make webUI available in Kubernetes apiserver address?*
>
>
>
>
>
>    1. Use case 2 (deploy with LoadBalancer)
>
> # fuyli @ fuyli-mac in ~/Development/flink-1.12.1 [17:59:00] C:127
>
> $ ./bin/flink run-application \
>
>     --target kubernetes-application \
>
>     -Dkubernetes.cluster-id=my-first-application-cluster \
>
>     -Dkubernetes.container.image=
> us-phoenix-1.ocir.io/idxglh0bz964/flink-demo:21.3.1 \
>
>     -Dkubernetes.container.image.pull-policy=IfNotPresent \
>
>     -Dkubernetes.container.image.pull-secrets=ocirsecret \
>
>     -Dkubernetes.rest-service.exposed.type=LoadBalancer \
>
>     -Dkubernetes.service-account=flink-service-account \
>
> local:///opt/flink/usrlib/quickstart-0.1.jar
>
>
>
>
>
> After a while, when the external IP is resolved. It said Flink JobManager
> web interface is at the external-IP (LOAD BALANCER address) at:
> http://144.25.13.78:8081
>
> When I execute the list command, I still got error after waiting for long
> time to let it get timeout. See errors here. [5]
>
>
>
> I can still get access to NodeIP:<service-port>. In such case, I tend to
> believe it is a network issue. But still quite confused since I am already
> open all the traffics..
>
>
>
>
>
>
>
>
>
> Reference:
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
>
> [2]
> https://github.com/apache/flink/blob/f3155e6c0213de7bf4b58a89fb1e1331dee7701a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
>
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html#accessing-flinks-web-ui
>
> [4] https://pastebin.ubuntu.com/p/WcJMwds52r/
>
> [5] https://pastebin.ubuntu.com/p/m27BnQGXQc/
>
>
>
>
>
> Thanks for your help in advance.
>
>
>
> Best regards,
>
> Fuyao
>
>
>
>
>