You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fuyao Li <fu...@oracle.com> on 2021/04/30 23:50:14 UTC

StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/

Best,
Fuyao

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello All,

I have solved the problem. Just in case someone need a reference in the future. I will share my problem solution here.

Problem 1: Log issue
Flink 1.12 have changed the default log4j configuration file name from log4j.properties to log4j-console.properties. From the operator��s perspective, the operator pods��s /opt/flink/conf directory must contain log4j-console.properties to enable logging to function properly. Log4j.properties won��t be recognized.

Problem 2:
stopWithSavepoint doesn��t work issue and Flink CLI stop/cancel/savepoint command doesn��t work with native Kubernetes.

For the CLI commands, I should add �Ctarget=Kubernetes-application -Dkubernetes.cluster-id=<cluster-id> to all application mode Flink CLI commands to make it work. For stop/cancel/savepoint command, I was directly following the doc here [1] without adding those configurations parameters. Flink documentation doesn��t point out explicitly and I was kind of confused here earlier.

Maybe the doc can add a note here to be more informative?

For stop command not working with my code issue, it was due to my Kafka-client is too low. I was using Kafka-client 1.1.0 (a very old version) and it works okay with my flink application. Because of the log issue, I didn��t managed to notice such error earlier. Actually, I got such an error during executing stop command.

java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'void org.apache.kafka.clients.producer.KafkaProducer.close(java.time.Duration)'

Stop command introduces some better semantic for restart a job and it calls this method in the Flink application. A low version of Kafka client will run into failure. Cancel command will not have such an issue. I didn��t look deep into the source code implementation for this, maybe you can share more insights about this.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/cli/

Thanks,
Fuyao

From: Yang Wang <da...@gmail.com>
Date: Friday, May 7, 2021 at 20:45
To: Fuyao Li <fu...@oracle.com>
Cc: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>, user <us...@flink.apache.org>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Since your problem is about the flink-native-k8s-operator, let's move the discussion there.

Best,
Yang

Fuyao Li <fu...@oracle.com>> ��2021��5��8������ ����5:41���
Hi Austin, Yang, Matthias,

I am following up to see if you guys have any idea regarding this problem.

I also created an issue here to describe the problem. [1]

After looking into the source code[1], I believe for native k8s, three configuration files should be added to the flink-config-<cluster-id> configmap automatically. However, it just have the flink-conf.yaml in the operator created flink application. And that is also causing the start command difference mentioned in the issue.


Native k8s using Flink CLI: Args:
      native-k8s
      $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/opt/flink/log/jobmanager.log -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b


Operator flink app:
    Args:
      native-k8s
      $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b

Please share your opinion on this. Thanks!

[1] https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!Kic66xJyvdUvhTTsvM8QZXyYZYUhVLI_tUdwbipnafqSDd7kF4wQn5taaZWaOoA$>
[2] https://github.com/apache/flink/blob/release-1.12/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/FlinkConfMountDecorator.java<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/release-1.12/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/FlinkConfMountDecorator.java__;!!GqivPVa7Brio!Kic66xJyvdUvhTTsvM8QZXyYZYUhVLI_tUdwbipnafqSDd7kF4wQn5taji-Q22E$>

Have a good weekend!
Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Tuesday, May 4, 2021 at 19:52
To: Austin Cawley-Edwards <au...@gmail.com>>, matthias@ververica.com<ma...@ververica.com> <ma...@ververica.com>>, Yang Wang <da...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

I also checked the native-k8s��s automatically generated configmap. It only contains the flink-conf.yaml, but no log4j.properties. I am not very familiar with the implementation details behind native k8s.

That should be the root cause, could you check the implementation and help me to locate the potential problem.
Yang��s initial code: https://github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j7UZwZEh4$>
My modified version: https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j78oKszhY$>

Thank you so much.

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Tuesday, May 4, 2021 at 19:34
To: Austin Cawley-Edwards <au...@gmail.com>>, matthias@ververica.com<ma...@ververica.com> <ma...@ververica.com>>, Yang Wang <da...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Austin, Yang,

For the logging issue, I think I have found something worth to notice.

They are all based on base image flink:1.12.1-scala_2.11-java11

Dockerfile: https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs7J-Fd4Q$>

In the JM and TM provisioned by the k8s operator. There is only flink-conf.yaml in the $FLINK_HOME/conf directory. Even if I tried to add these configurations to the image in advance. It seems the operator is seems overriding it and removing all other log4j configurations. This is causing the logs can��t be printed correctly.
root@flink-demo-5fc78c8cf-hgvcj:/opt/flink/conf# ls
flink-conf.yaml


However, for the pods that is provisioned by flink native k8s CLI. There exists some log4j related configurations.

root@test-application-cluster-79c7f9dcf7-44bq8:/opt/flink/conf# ls
flink-conf.yaml  log4j-console.properties  logback-console.xml


The native Kubernetes operator pod can print logs correctly because it has the log4j.properties file mounted to the opt/flink/conf/ directory. [1]
For the Flink pods, it seems that it only have a flink-conf.yaml injected there. [2][3] No log4j related configmap is configured. That makes the logs in those pods no available.

I am not sure how to inject something similar to the flink pods? Maybe adding some similar structure that exists in [1], into the cr.yaml ? So that such configmap will make the log4j.properties available for flink CRD?

I am kind of confused at how to implement this. The deployment is a one-step operation in [4]. I don��t know how to make a configmap available to it? Maybe I can only use the new feature �C pod template in Flink 1.13 to do this?



[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml#L58<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml*L58__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsYsKSwdA$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml#L21<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml*L21__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs-oWyvPk$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java#L83<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java*L83__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsk28At-8$>
[4] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java#L176<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java*L176__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsc49StKI$>

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Tuesday, May 4, 2021 at 15:23
To: Austin Cawley-Edwards <au...@gmail.com>>, matthias@ververica.com<ma...@ververica.com> <ma...@ververica.com>>
Cc: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>, Austin Cawley-Edwards <au...@ververica.com>>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

For Logging issue:
Hi Austin,

This is the my full code implementation link [1], I just removed the credential related things. The operator definition can be found here.[2] You can also check other parts if you find any problem.
The operator uses log4j2 and you can see there is a log appender for operator [3] and the operator pod can indeed print out logs. But for the flink application JM and TM pod, I can see the errors mentioned earlier. Sed error and ERROR StatusLogger No Log4j 2 configuration file found.

I used to use log4j for flink application, to avoid potential incompatible issue, I have already upgraded the POM for flink application to use log4j2. But the logging problem still exists.

This is my log4j2.properties file in flink application. [6] This is the loggin related pom dependencies for flink application [7].

The logs can be printed during normal native k8s deployment and IDE debugging. When it comes to the operator, it seems not working. Could this be caused by class namespace conflict? Since I introduced the presto jar in the flink distribution. This is my Dockerfile to build the flink application jar [5].

Please share your idea on this.

For the stopWithSavepoint issue,
Just to note, I understand cancel command (cancelWithSavepoint() ) is a deprecated feature and it may not guarantee exactly once semantic and get inconsistent result, like Timer related things? Please correct me if I am wrong. The code that works with cancelWithSavepoint() is shared in [4] below.


[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zS0pc8xyg$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSxwZ7y0c$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSmcVTvfY$>
[4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tcxT2FwPRS/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSpACgSC8$>
[5] https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSFEpFx6Q$>
[6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/2wgdcxVfSy/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSZDmRV7I$>
[7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/Sq8xRjQyVY/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSDxCbn6M$>


Best,
Fuyao

From: Austin Cawley-Edwards <au...@gmail.com>>
Date: Tuesday, May 4, 2021 at 14:47
To: matthias@ververica.com<ma...@ververica.com> <ma...@ververica.com>>
Cc: Fuyao Li <fu...@oracle.com>>, user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>, Austin Cawley-Edwards <au...@ververica.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang Wang<ma...@gmail.com>'s operator, to be honest :(. Can you share some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the container. Which logging framework are you using in your application and how have you configured it? There's a good guide for configuring the popular ones in the Flink docs[1]. For instance, if you're using the default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
[2]: https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>> wrote:
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn��t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can��t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here�� Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn��t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn��t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here�� Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn��t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Yang Wang <da...@gmail.com>.
Since your problem is about the flink-native-k8s-operator, let's move the
discussion there.

Best,
Yang

Fuyao Li <fu...@oracle.com> 于2021年5月8日周六 上午5:41写道:

> Hi Austin, Yang, Matthias,
>
>
>
> I am following up to see if you guys have any idea regarding this problem.
>
>
>
> I also created an issue here to describe the problem. [1]
>
>
>
> After looking into the source code[1], I believe for native k8s, three
> configuration files should be added to the flink-config-<cluster-id>
> configmap automatically. However, it just have the flink-conf.yaml in the
> operator created flink application. And that is also causing the start
> command difference mentioned in the issue.
>
>
>
> Native k8s using Flink CLI: Args:
>       native-k8s
>       $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/opt/flink/log/jobmanager.log -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b
>
>
>
>
>
> Operator flink app:
>
>     Args:
>       native-k8s
>       $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376
> -Xms3462817376 -XX:MaxMetaspaceSize=268435456
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
> -D jobmanager.memory.off-heap.size=134217728b -D
> jobmanager.memory.jvm-overhead.min=429496736b -D
> jobmanager.memory.jvm-metaspace.size=268435456b -D
> jobmanager.memory.heap.size=3462817376b -D
> jobmanager.memory.jvm-overhead.max=429496736b
>
>
>
> Please share your opinion on this. Thanks!
>
>
>
> [1] https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
>
> [2]
> https://github.com/apache/flink/blob/release-1.12/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/FlinkConfMountDecorator.java
>
>
>
> Have a good weekend!
>
> Best,
>
> Fuyao
>
>
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Tuesday, May 4, 2021 at 19:52
> *To: *Austin Cawley-Edwards <au...@gmail.com>,
> matthias@ververica.com <ma...@ververica.com>, Yang Wang <
> danrtsey.wy@gmail.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *Re: [External] : Re: StopWithSavepoint() method doesn't work
> in Java based flink native k8s operator
>
> Hello All,
>
>
>
> I also checked the native-k8s’s automatically generated configmap. It only
> contains the flink-conf.yaml, but no log4j.properties. I am not very
> familiar with the implementation details behind native k8s.
>
>
>
> That should be the root cause, could you check the implementation and help
> me to locate the potential problem.
>
> Yang’s initial code:
> https://github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j7UZwZEh4$>
>
> My modified version:
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j78oKszhY$>
>
>
>
> Thank you so much.
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Tuesday, May 4, 2021 at 19:34
> *To: *Austin Cawley-Edwards <au...@gmail.com>,
> matthias@ververica.com <ma...@ververica.com>, Yang Wang <
> danrtsey.wy@gmail.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *Re: [External] : Re: StopWithSavepoint() method doesn't work
> in Java based flink native k8s operator
>
> Hello Austin, Yang,
>
>
>
> For the logging issue, I think I have found something worth to notice.
>
>
>
> They are all based on base image flink:1.12.1-scala_2.11-java11
>
>
>
> Dockerfile: https://pastebin.ubuntu.com/p/JTsHygsTP6/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs7J-Fd4Q$>
>
>
>
> In the JM and TM provisioned by the k8s operator. There is only
> flink-conf.yaml in the $FLINK_HOME/conf directory. Even if I tried to add
> these configurations to the image in advance. It seems the operator is
> seems overriding it and removing all other log4j configurations. This is
> causing the logs can’t be printed correctly.
>
> root@flink-demo-5fc78c8cf-hgvcj:/opt/flink/conf# ls
>
> flink-conf.yaml
>
>
>
>
>
> However, for the pods that is provisioned by flink native k8s CLI. There
> exists some log4j related configurations.
>
>
>
> root@test-application-cluster-79c7f9dcf7-44bq8:/opt/flink/conf# ls
>
> flink-conf.yaml  log4j-console.properties  logback-console.xml
>
>
>
>
>
> The native Kubernetes operator pod can print logs correctly because it has
> the log4j.properties file mounted to the opt/flink/conf/ directory. [1]
>
> For the Flink pods, it seems that it only have a flink-conf.yaml injected
> there. [2][3] No log4j related configmap is configured. That makes the logs
> in those pods no available.
>
>
>
> I am not sure how to inject something similar to the flink pods? Maybe
> adding some similar structure that exists in [1], into the cr.yaml ? So
> that such configmap will make the log4j.properties available for flink CRD?
>
>
>
> I am kind of confused at how to implement this. The deployment is a
> one-step operation in [4]. I don’t know how to make a configmap available
> to it? Maybe I can only use the new feature – pod template in Flink 1.13 to
> do this?
>
>
>
>
>
>
>
> [1]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml#L58
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml*L58__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsYsKSwdA$>
>
> [2]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml#L21
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml*L21__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs-oWyvPk$>
>
> [3]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java#L83
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java*L83__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsk28At-8$>
>
> [4]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java#L176
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java*L176__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsc49StKI$>
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Tuesday, May 4, 2021 at 15:23
> *To: *Austin Cawley-Edwards <au...@gmail.com>,
> matthias@ververica.com <ma...@ververica.com>
> *Cc: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>,
> Austin Cawley-Edwards <au...@ververica.com>
> *Subject: *Re: [External] : Re: StopWithSavepoint() method doesn't work
> in Java based flink native k8s operator
>
> Hello All,
>
>
>
> *For Logging issue:*
>
> Hi Austin,
>
>
>
> This is the my full code implementation link [1], I just removed the
> credential related things. The operator definition can be found here.[2]
> You can also check other parts if you find any problem.
>
> The operator uses log4j2 and you can see there is a log appender for
> operator [3] and the operator pod can indeed print out logs. But for the
> flink application JM and TM pod, I can see the errors mentioned earlier.
> Sed error and ERROR StatusLogger No Log4j 2 configuration file found.
>
>
>
> I used to use log4j for flink application, to avoid potential incompatible
> issue, I have already upgraded the POM for flink application to use log4j2.
> But the logging problem still exists.
>
>
>
> This is my log4j2.properties file in flink application. [6] This is the
> loggin related pom dependencies for flink application [7].
>
>
>
> The logs can be printed during normal native k8s deployment and IDE
> debugging. When it comes to the operator, it seems not working. Could this
> be caused by class namespace conflict? Since I introduced the presto jar in
> the flink distribution. This is my Dockerfile to build the flink
> application jar [5].
>
>
>
> Please share your idea on this.
>
>
>
> *For the stopWithSavepoint issue, *
>
> Just to note, I understand cancel command (cancelWithSavepoint() ) is a
> deprecated feature and it may not guarantee exactly once semantic and get
> inconsistent result, like Timer related things? Please correct me if I am
> wrong. The code that works with cancelWithSavepoint() is shared in [4]
> below.
>
>
>
>
>
> [1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zS0pc8xyg$>
>
> [2]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSxwZ7y0c$>
>
> [3]
> https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties
> <https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSmcVTvfY$>
>
> [4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tcxT2FwPRS/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSpACgSC8$>
>
> [5] https://pastebin.ubuntu.com/p/JTsHygsTP6/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSFEpFx6Q$>
>
> [6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/2wgdcxVfSy/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSZDmRV7I$>
>
> [7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/Sq8xRjQyVY/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSDxCbn6M$>
>
>
>
>
>
> Best,
>
> Fuyao
>
>
>
> *From: *Austin Cawley-Edwards <au...@gmail.com>
> *Date: *Tuesday, May 4, 2021 at 14:47
> *To: *matthias@ververica.com <ma...@ververica.com>
> *Cc: *Fuyao Li <fu...@oracle.com>, user <us...@flink.apache.org>, Yang
> Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
> Java based flink native k8s operator
>
> Hey all,
>
>
>
> Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang
> Wang <da...@gmail.com>'s operator, to be honest :(. Can you share
> some of your FlinkApplication specs?
>
>
>
> For the `kubectl logs` command, I believe that just reads stdout from the
> container. Which logging framework are you using in your application and
> how have you configured it? There's a good guide for configuring the
> popular ones in the Flink docs[1]. For instance, if you're using the
> default Log4j 2 framework you should configure a ConsoleAppender[2].
>
>
>
> Hope that helps a bit,
>
> Austin
>
>
>
> [1]:
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/
> <https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
>
> [2]:
> https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender
> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>
>
>
>
> On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
> Hi Fuyao,
>
> sorry for not replying earlier. The stop-with-savepoint operation
> shouldn't only suspend but terminate the job. Is it that you might have a
> larger state that makes creating the savepoint take longer? Even though,
> considering that you don't experience this behavior with your 2nd solution,
> I'd assume that we could ignore this possibility.
>
>
>
> I'm gonna add Austin to the conversation as he worked with k8s operators
> as well already. Maybe, he can also give you more insights on the logging
> issue which would enable us to dig deeper into what's going on with
> stop-with-savepoint.
>
>
>
> Best,
>
> Matthias
>
>
>
> On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com> wrote:
>
> Hello,
>
>
>
> Update:
>
> I think stopWithSavepoint() only suspend the job. It doesn’t actually
> terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint()
> and it works here.
>
>
>
> Maybe stopWithSavepoint() should only be used to update the configurations
> like parallelism? For updating the image, this seems to be not suitable,
> please correct me if I am wrong.
>
>
>
> For the log issue, I am still a bit confused. Why it is not available in
> kubectl logs. How should I get access to it?
>
>
>
> Thanks.
>
> Best,
>
> Fuyao
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Sunday, May 2, 2021 at 00:36
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
> Java based flink native k8s operator
>
> Hello,
>
>
>
> I noticed that first trigger a savepoint and then delete the deployment
> might cause the duplicate data issue. That could pose a bad influence to
> the semantic correctness. Please give me some hints on how to make the
> stopWithSavepoint() work correctly with Fabric8io Java k8s client to
> perform this image update operation. Thanks!
>
>
>
> Best,
>
> Fuyao
>
>
>
>
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Friday, April 30, 2021 at 18:03
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
> Java based flink native k8s operator
>
> Hello Community, Yang,
>
>
>
> I have one more question for logging. I also noticed that if I execute
> kubectl logs  command to the JM. The pods provisioned by the operator can’t
> print out the internal Flink logs in the kubectl logs. I can only get
> something like the logs below. No actual flink logs is printed here… Where
> can I find the path to the logs? Maybe use a sidecar container to get it
> out? How can I get the logs without checking the Flink WebUI? Also, the sed
> error makes me confused here. In fact, the application is already up and
> running correctly if I access the WebUI through Ingress.
>
>
>
> Reference:
> https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>
>
>
>
>
>
> [root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk
>
>
>
> sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only
> file system
>
> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only
> file system
>
> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
> Read-only file system
>
> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
> -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
> -D jobmanager.memory.off-heap.size=134217728b -D
> jobmanager.memory.jvm-overhead.min=429496736b -D
> jobmanager.memory.jvm-metaspace.size=268435456b -D
> jobmanager.memory.heap.size=3462817376b -D
> jobmanager.memory.jvm-overhead.max=429496736b
>
> ERROR StatusLogger No Log4j 2 configuration file found. Using default
> configuration (logging only errors to the console), or user
> programmatically provided configurations. Set system property
> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
> https://logging.apache.org/log4j/2.x/manual/configuration.html
> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
> for instructions on how to configure Log4j 2
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by
> org.apache.flink.api.java.ClosureCleaner
> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field
> java.util.Properties.serialVersionUID
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.flink.api.java.ClosureCleaner
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
>
>
>
>
> -------- The logs stops here, flink applications logs doesn’t get printed
> here anymore---------
>
>
>
> ^C
>
> [root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
>
> sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only
> file system
>
> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only
> file system
>
> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
> Read-only file system
>
> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
> -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712
> -XX:MaxMetaspaceSize=268435456
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=166429984b -D
> taskmanager.memory.network.min=166429984b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=563714445b -D
> taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf
> -Djobmanager.memory.jvm-overhead.min='429496736b'
> -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar'
> -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1'
> -Djobmanager.memory.off-heap.size='134217728b'
> -Dexecution.target='embedded'
> -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692'
> -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar'
> -Djobmanager.memory.jvm-metaspace.size='268435456b'
> -Djobmanager.memory.heap.size='3462817376b'
> -Djobmanager.memory.jvm-overhead.max='429496736b'
>
> ERROR StatusLogger No Log4j 2 configuration file found. Using default
> configuration (logging only errors to the console), or user
> programmatically provided configurations. Set system property
> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
> https://logging.apache.org/log4j/2.x/manual/configuration.html
> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
> for instructions on how to configure Log4j 2
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by
> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method
> java.nio.DirectByteBuffer.cleaner()
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
>
> SEVERE: attempt to configure ONS in FanManager failed with
> oracle.ons.NoServersAvailable: Subscription time out
>
>
>
>
>
> -------- The logs stops here, flink applications logs doesn’t get printed
> here anymore---------
>
>
>
>
>
> Best,
>
> Fuyao
>
>
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Friday, April 30, 2021 at 16:50
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : StopWithSavepoint() method doesn't work in Java
> based flink native k8s operator
>
> Hello Community, Yang,
>
>
>
> I am trying to extend the flink native Kubernetes operator by adding some
> new features based on the repo [1]. I wrote a method to release the image
> update functionality. [2] I added the
>
> triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);
>
>
>
> under the existing method.
>
> triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);
>
>
>
>
>
> I wrote a function to accommodate the image change behavior.[2]
>
>
>
> Solution1:
>
> I want to use stopWithSavepoint() method to complete the task. However, I
> found it will get stuck and never get completed. Even if I use get() for
> the completeableFuture. It will always timeout and throw exceptions. See
> solution 1 logs [3]
>
>
>
> Solution2:
>
> I tried to trigger a savepoint, then delete the deployment in the code and
> then create a new application with new image. This seems to work fine. Log
> link: [4]
>
>
>
> My questions:
>
>    1. Why solution 1 will get stuck? triggerSavepoint()
>    CompleteableFuture could work here… Why stopWithSavepoint() will always get
>    stuck or timeout? Very confused.
>    2. For Fabric8io library, I am still new to it, did I do anything
>    wrong in the implementation, maybe I should update the jobStatus? Please
>    give me some suggestions.
>    3. For work around solution 2, is there any bad influence I didn’t
>    notice?
>
>
>
>
>
> [1] https://github.com/wangyang0918/flink-native-k8s-operator
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
>
> [2] https://pastebin.ubuntu.com/p/tQShjmdcJt/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
>
> [3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
>
> [4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>
>
>
>
> Best,
>
> Fuyao
>
>

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hi Austin, Yang, Matthias,

I am following up to see if you guys have any idea regarding this problem.

I also created an issue here to describe the problem. [1]

After looking into the source code[1], I believe for native k8s, three configuration files should be added to the flink-config-<cluster-id> configmap automatically. However, it just have the flink-conf.yaml in the operator created flink application. And that is also causing the start command difference mentioned in the issue.


Native k8s using Flink CLI: Args:
      native-k8s
      $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx1073741824 -Xms1073741824 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/opt/flink/log/jobmanager.log -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=201326592b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=1073741824b -D jobmanager.memory.jvm-overhead.max=201326592b


Operator flink app:
    Args:
      native-k8s
      $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b

Please share your opinion on this. Thanks!

[1] https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
[2] https://github.com/apache/flink/blob/release-1.12/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/FlinkConfMountDecorator.java

Have a good weekend!
Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 19:52
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>, Yang Wang <da...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

I also checked the native-k8s’s automatically generated configmap. It only contains the flink-conf.yaml, but no log4j.properties. I am not very familiar with the implementation details behind native k8s.

That should be the root cause, could you check the implementation and help me to locate the potential problem.
Yang’s initial code: https://github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j7UZwZEh4$>
My modified version: https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java__;!!GqivPVa7Brio!K_YcBE0y2rd7zQtMtIKmSNNUpMmUTUSA8VdkNCJ8i9w2tH2nwNWoq3j78oKszhY$>

Thank you so much.

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 19:34
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>, Yang Wang <da...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Austin, Yang,

For the logging issue, I think I have found something worth to notice.

They are all based on base image flink:1.12.1-scala_2.11-java11

Dockerfile: https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs7J-Fd4Q$>

In the JM and TM provisioned by the k8s operator. There is only flink-conf.yaml in the $FLINK_HOME/conf directory. Even if I tried to add these configurations to the image in advance. It seems the operator is seems overriding it and removing all other log4j configurations. This is causing the logs can’t be printed correctly.
root@flink-demo-5fc78c8cf-hgvcj:/opt/flink/conf# ls
flink-conf.yaml


However, for the pods that is provisioned by flink native k8s CLI. There exists some log4j related configurations.

root@test-application-cluster-79c7f9dcf7-44bq8:/opt/flink/conf# ls
flink-conf.yaml  log4j-console.properties  logback-console.xml


The native Kubernetes operator pod can print logs correctly because it has the log4j.properties file mounted to the opt/flink/conf/ directory. [1]
For the Flink pods, it seems that it only have a flink-conf.yaml injected there. [2][3] No log4j related configmap is configured. That makes the logs in those pods no available.

I am not sure how to inject something similar to the flink pods? Maybe adding some similar structure that exists in [1], into the cr.yaml ? So that such configmap will make the log4j.properties available for flink CRD?

I am kind of confused at how to implement this. The deployment is a one-step operation in [4]. I don’t know how to make a configmap available to it? Maybe I can only use the new feature – pod template in Flink 1.13 to do this?



[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml#L58<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml*L58__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsYsKSwdA$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml#L21<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml*L21__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs-oWyvPk$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java#L83<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java*L83__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsk28At-8$>
[4] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java#L176<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java*L176__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsc49StKI$>

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 15:23
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>
Cc: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

For Logging issue:
Hi Austin,

This is the my full code implementation link [1], I just removed the credential related things. The operator definition can be found here.[2] You can also check other parts if you find any problem.
The operator uses log4j2 and you can see there is a log appender for operator [3] and the operator pod can indeed print out logs. But for the flink application JM and TM pod, I can see the errors mentioned earlier. Sed error and ERROR StatusLogger No Log4j 2 configuration file found.

I used to use log4j for flink application, to avoid potential incompatible issue, I have already upgraded the POM for flink application to use log4j2. But the logging problem still exists.

This is my log4j2.properties file in flink application. [6] This is the loggin related pom dependencies for flink application [7].

The logs can be printed during normal native k8s deployment and IDE debugging. When it comes to the operator, it seems not working. Could this be caused by class namespace conflict? Since I introduced the presto jar in the flink distribution. This is my Dockerfile to build the flink application jar [5].

Please share your idea on this.

For the stopWithSavepoint issue,
Just to note, I understand cancel command (cancelWithSavepoint() ) is a deprecated feature and it may not guarantee exactly once semantic and get inconsistent result, like Timer related things? Please correct me if I am wrong. The code that works with cancelWithSavepoint() is shared in [4] below.


[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zS0pc8xyg$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSxwZ7y0c$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSmcVTvfY$>
[4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tcxT2FwPRS/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSpACgSC8$>
[5] https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSFEpFx6Q$>
[6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/2wgdcxVfSy/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSZDmRV7I$>
[7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/Sq8xRjQyVY/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSDxCbn6M$>


Best,
Fuyao

From: Austin Cawley-Edwards <au...@gmail.com>
Date: Tuesday, May 4, 2021 at 14:47
To: matthias@ververica.com <ma...@ververica.com>
Cc: Fuyao Li <fu...@oracle.com>, user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang Wang<ma...@gmail.com>'s operator, to be honest :(. Can you share some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the container. Which logging framework are you using in your application and how have you configured it? There's a good guide for configuring the popular ones in the Flink docs[1]. For instance, if you're using the default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
[2]: https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>> wrote:
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello All,

I also checked the native-k8s’s automatically generated configmap. It only contains the flink-conf.yaml, but no log4j.properties. I am not very familiar with the implementation details behind native k8s.

That should be the root cause, could you check the implementation and help me to locate the potential problem.
Yang’s initial code: https://github.com/wangyang0918/flink-native-k8s-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java
My modified version: https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java

Thank you so much.

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 19:34
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>, Yang Wang <da...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Austin, Yang,

For the logging issue, I think I have found something worth to notice.

They are all based on base image flink:1.12.1-scala_2.11-java11

Dockerfile: https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs7J-Fd4Q$>

In the JM and TM provisioned by the k8s operator. There is only flink-conf.yaml in the $FLINK_HOME/conf directory. Even if I tried to add these configurations to the image in advance. It seems the operator is seems overriding it and removing all other log4j configurations. This is causing the logs can’t be printed correctly.
root@flink-demo-5fc78c8cf-hgvcj:/opt/flink/conf# ls
flink-conf.yaml


However, for the pods that is provisioned by flink native k8s CLI. There exists some log4j related configurations.

root@test-application-cluster-79c7f9dcf7-44bq8:/opt/flink/conf# ls
flink-conf.yaml  log4j-console.properties  logback-console.xml


The native Kubernetes operator pod can print logs correctly because it has the log4j.properties file mounted to the opt/flink/conf/ directory. [1]
For the Flink pods, it seems that it only have a flink-conf.yaml injected there. [2][3] No log4j related configmap is configured. That makes the logs in those pods no available.

I am not sure how to inject something similar to the flink pods? Maybe adding some similar structure that exists in [1], into the cr.yaml ? So that such configmap will make the log4j.properties available for flink CRD?

I am kind of confused at how to implement this. The deployment is a one-step operation in [4]. I don’t know how to make a configmap available to it? Maybe I can only use the new feature – pod template in Flink 1.13 to do this?



[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml#L58<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml*L58__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsYsKSwdA$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml#L21<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml*L21__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVs-oWyvPk$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java#L83<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java*L83__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsk28At-8$>
[4] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java#L176<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java*L176__;Iw!!GqivPVa7Brio!OVpkpwiox1P7d8b34Nmo9xfzrPlbO4cW65uPc-xLTxaPavtVFQ1GVzVsc49StKI$>

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 15:23
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>
Cc: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

For Logging issue:
Hi Austin,

This is the my full code implementation link [1], I just removed the credential related things. The operator definition can be found here.[2] You can also check other parts if you find any problem.
The operator uses log4j2 and you can see there is a log appender for operator [3] and the operator pod can indeed print out logs. But for the flink application JM and TM pod, I can see the errors mentioned earlier. Sed error and ERROR StatusLogger No Log4j 2 configuration file found.

I used to use log4j for flink application, to avoid potential incompatible issue, I have already upgraded the POM for flink application to use log4j2. But the logging problem still exists.

This is my log4j2.properties file in flink application. [6] This is the loggin related pom dependencies for flink application [7].

The logs can be printed during normal native k8s deployment and IDE debugging. When it comes to the operator, it seems not working. Could this be caused by class namespace conflict? Since I introduced the presto jar in the flink distribution. This is my Dockerfile to build the flink application jar [5].

Please share your idea on this.

For the stopWithSavepoint issue,
Just to note, I understand cancel command (cancelWithSavepoint() ) is a deprecated feature and it may not guarantee exactly once semantic and get inconsistent result, like Timer related things? Please correct me if I am wrong. The code that works with cancelWithSavepoint() is shared in [4] below.


[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zS0pc8xyg$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSxwZ7y0c$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSmcVTvfY$>
[4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tcxT2FwPRS/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSpACgSC8$>
[5] https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSFEpFx6Q$>
[6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/2wgdcxVfSy/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSZDmRV7I$>
[7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/Sq8xRjQyVY/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSDxCbn6M$>


Best,
Fuyao

From: Austin Cawley-Edwards <au...@gmail.com>
Date: Tuesday, May 4, 2021 at 14:47
To: matthias@ververica.com <ma...@ververica.com>
Cc: Fuyao Li <fu...@oracle.com>, user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang Wang<ma...@gmail.com>'s operator, to be honest :(. Can you share some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the container. Which logging framework are you using in your application and how have you configured it? There's a good guide for configuring the popular ones in the Flink docs[1]. For instance, if you're using the default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
[2]: https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>> wrote:
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello Austin, Yang,

For the logging issue, I think I have found something worth to notice.

They are all based on base image flink:1.12.1-scala_2.11-java11

Dockerfile: https://pastebin.ubuntu.com/p/JTsHygsTP6/

In the JM and TM provisioned by the k8s operator. There is only flink-conf.yaml in the $FLINK_HOME/conf directory. Even if I tried to add these configurations to the image in advance. It seems the operator is seems overriding it and removing all other log4j configurations. This is causing the logs can’t be printed correctly.
root@flink-demo-5fc78c8cf-hgvcj:/opt/flink/conf# ls
flink-conf.yaml


However, for the pods that is provisioned by flink native k8s CLI. There exists some log4j related configurations.

root@test-application-cluster-79c7f9dcf7-44bq8:/opt/flink/conf# ls
flink-conf.yaml  log4j-console.properties  logback-console.xml


The native Kubernetes operator pod can print logs correctly because it has the log4j.properties file mounted to the opt/flink/conf/ directory. [1]
For the Flink pods, it seems that it only have a flink-conf.yaml injected there. [2][3] No log4j related configmap is configured. That makes the logs in those pods no available.

I am not sure how to inject something similar to the flink pods? Maybe adding some similar structure that exists in [1], into the cr.yaml ? So that such configmap will make the log4j.properties available for flink CRD?

I am kind of confused at how to implement this. The deployment is a one-step operation in [4]. I don’t know how to make a configmap available to it? Maybe I can only use the new feature – pod template in Flink 1.13 to do this?



[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml#L58
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/cr.yaml#L21
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/Utils/FlinkUtils.java#L83
[4] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkApplicationController.java#L176

Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Tuesday, May 4, 2021 at 15:23
To: Austin Cawley-Edwards <au...@gmail.com>, matthias@ververica.com <ma...@ververica.com>
Cc: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello All,

For Logging issue:
Hi Austin,

This is the my full code implementation link [1], I just removed the credential related things. The operator definition can be found here.[2] You can also check other parts if you find any problem.
The operator uses log4j2 and you can see there is a log appender for operator [3] and the operator pod can indeed print out logs. But for the flink application JM and TM pod, I can see the errors mentioned earlier. Sed error and ERROR StatusLogger No Log4j 2 configuration file found.

I used to use log4j for flink application, to avoid potential incompatible issue, I have already upgraded the POM for flink application to use log4j2. But the logging problem still exists.

This is my log4j2.properties file in flink application. [6] This is the loggin related pom dependencies for flink application [7].

The logs can be printed during normal native k8s deployment and IDE debugging. When it comes to the operator, it seems not working. Could this be caused by class namespace conflict? Since I introduced the presto jar in the flink distribution. This is my Dockerfile to build the flink application jar [5].

Please share your idea on this.

For the stopWithSavepoint issue,
Just to note, I understand cancel command (cancelWithSavepoint() ) is a deprecated feature and it may not guarantee exactly once semantic and get inconsistent result, like Timer related things? Please correct me if I am wrong. The code that works with cancelWithSavepoint() is shared in [4] below.


[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zS0pc8xyg$>
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSxwZ7y0c$>
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties<https://urldefense.com/v3/__https:/github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSmcVTvfY$>
[4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tcxT2FwPRS/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSpACgSC8$>
[5] https://pastebin.ubuntu.com/p/JTsHygsTP6/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/JTsHygsTP6/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSFEpFx6Q$>
[6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/2wgdcxVfSy/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSZDmRV7I$>
[7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/Sq8xRjQyVY/__;!!GqivPVa7Brio!M43rnv3I8fCMsbDbdQLrgzWtsF6rSJJlebU1hGD9w0OMn702aTN6N-zSDxCbn6M$>


Best,
Fuyao

From: Austin Cawley-Edwards <au...@gmail.com>
Date: Tuesday, May 4, 2021 at 14:47
To: matthias@ververica.com <ma...@ververica.com>
Cc: Fuyao Li <fu...@oracle.com>, user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang Wang<ma...@gmail.com>'s operator, to be honest :(. Can you share some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the container. Which logging framework are you using in your application and how have you configured it? There's a good guide for configuring the popular ones in the Flink docs[1]. For instance, if you're using the default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
[2]: https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>> wrote:
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello All,

For Logging issue:
Hi Austin,

This is the my full code implementation link [1], I just removed the credential related things. The operator definition can be found here.[2] You can also check other parts if you find any problem.
The operator uses log4j2 and you can see there is a log appender for operator [3] and the operator pod can indeed print out logs. But for the flink application JM and TM pod, I can see the errors mentioned earlier. Sed error and ERROR StatusLogger No Log4j 2 configuration file found.

I used to use log4j for flink application, to avoid potential incompatible issue, I have already upgraded the POM for flink application to use log4j2. But the logging problem still exists.

This is my log4j2.properties file in flink application. [6] This is the loggin related pom dependencies for flink application [7].

The logs can be printed during normal native k8s deployment and IDE debugging. When it comes to the operator, it seems not working. Could this be caused by class namespace conflict? Since I introduced the presto jar in the flink distribution. This is my Dockerfile to build the flink application jar [5].

Please share your idea on this.

For the stopWithSavepoint issue,
Just to note, I understand cancel command (cancelWithSavepoint() ) is a deprecated feature and it may not guarantee exactly once semantic and get inconsistent result, like Timer related things? Please correct me if I am wrong. The code that works with cancelWithSavepoint() is shared in [4] below.


[1] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator
[2] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/deploy/flink-native-k8s-operator.yaml
[3] https://github.com/FuyaoLi2017/flink-native-kubernetes-operator/blob/master/src/main/resources/log4j2.properties
[4] https://pastebin.ubuntu.com/p/tcxT2FwPRS/
[5] https://pastebin.ubuntu.com/p/JTsHygsTP6/
[6] https://pastebin.ubuntu.com/p/2wgdcxVfSy/
[7] https://pastebin.ubuntu.com/p/Sq8xRjQyVY/


Best,
Fuyao

From: Austin Cawley-Edwards <au...@gmail.com>
Date: Tuesday, May 4, 2021 at 14:47
To: matthias@ververica.com <ma...@ververica.com>
Cc: Fuyao Li <fu...@oracle.com>, user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang Wang<ma...@gmail.com>'s operator, to be honest :(. Can you share some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the container. Which logging framework are you using in your application and how have you configured it? There's a good guide for configuring the popular ones in the Flink docs[1]. For instance, if you're using the default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/__;!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvCjPpPPQ$>
[2]: https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/appenders.html*ConsoleAppender__;Iw!!GqivPVa7Brio!IkcTZZ5rY-669_XS8ldTeXg0NeH1nsQkupDh_zuUHAC4yqDOoiJ6f2EvHl40sbA$>

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com>> wrote:
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Austin Cawley-Edwards <au...@gmail.com>.
Hey all,

Thanks for the ping, Matthias. I'm not super familiar with the details of @Yang
Wang <da...@gmail.com>'s operator, to be honest :(. Can you share
some of your FlinkApplication specs?

For the `kubectl logs` command, I believe that just reads stdout from the
container. Which logging framework are you using in your application and
how have you configured it? There's a good guide for configuring the
popular ones in the Flink docs[1]. For instance, if you're using the
default Log4j 2 framework you should configure a ConsoleAppender[2].

Hope that helps a bit,
Austin

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/advanced/logging/
[2]:
https://logging.apache.org/log4j/2.x/manual/appenders.html#ConsoleAppender

On Tue, May 4, 2021 at 1:59 AM Matthias Pohl <ma...@ververica.com> wrote:

> Hi Fuyao,
> sorry for not replying earlier. The stop-with-savepoint operation
> shouldn't only suspend but terminate the job. Is it that you might have a
> larger state that makes creating the savepoint take longer? Even though,
> considering that you don't experience this behavior with your 2nd solution,
> I'd assume that we could ignore this possibility.
>
> I'm gonna add Austin to the conversation as he worked with k8s operators
> as well already. Maybe, he can also give you more insights on the logging
> issue which would enable us to dig deeper into what's going on with
> stop-with-savepoint.
>
> Best,
> Matthias
>
> On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com> wrote:
>
>> Hello,
>>
>>
>>
>> Update:
>>
>> I think stopWithSavepoint() only suspend the job. It doesn’t actually
>> terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint()
>> and it works here.
>>
>>
>>
>> Maybe stopWithSavepoint() should only be used to update the
>> configurations like parallelism? For updating the image, this seems to be
>> not suitable, please correct me if I am wrong.
>>
>>
>>
>> For the log issue, I am still a bit confused. Why it is not available in
>> kubectl logs. How should I get access to it?
>>
>>
>>
>> Thanks.
>>
>> Best,
>>
>> Fuyao
>>
>>
>>
>> *From: *Fuyao Li <fu...@oracle.com>
>> *Date: *Sunday, May 2, 2021 at 00:36
>> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
>> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
>> Java based flink native k8s operator
>>
>> Hello,
>>
>>
>>
>> I noticed that first trigger a savepoint and then delete the deployment
>> might cause the duplicate data issue. That could pose a bad influence to
>> the semantic correctness. Please give me some hints on how to make the
>> stopWithSavepoint() work correctly with Fabric8io Java k8s client to
>> perform this image update operation. Thanks!
>>
>>
>>
>> Best,
>>
>> Fuyao
>>
>>
>>
>>
>>
>>
>>
>> *From: *Fuyao Li <fu...@oracle.com>
>> *Date: *Friday, April 30, 2021 at 18:03
>> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
>> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
>> Java based flink native k8s operator
>>
>> Hello Community, Yang,
>>
>>
>>
>> I have one more question for logging. I also noticed that if I execute
>> kubectl logs  command to the JM. The pods provisioned by the operator can’t
>> print out the internal Flink logs in the kubectl logs. I can only get
>> something like the logs below. No actual flink logs is printed here… Where
>> can I find the path to the logs? Maybe use a sidecar container to get it
>> out? How can I get the logs without checking the Flink WebUI? Also, the sed
>> error makes me confused here. In fact, the application is already up and
>> running correctly if I access the WebUI through Ingress.
>>
>>
>>
>> Reference:
>> https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
>> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>
>>
>>
>>
>>
>>
>> [root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk
>>
>>
>>
>> sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only
>> file system
>>
>> sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only
>> file system
>>
>> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml:
>> Read-only file system
>>
>> sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only
>> file system
>>
>> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml:
>> Read-only file system
>>
>> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
>> Read-only file system
>>
>> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
>> -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456
>> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
>> -D jobmanager.memory.off-heap.size=134217728b -D
>> jobmanager.memory.jvm-overhead.min=429496736b -D
>> jobmanager.memory.jvm-metaspace.size=268435456b -D
>> jobmanager.memory.heap.size=3462817376b -D
>> jobmanager.memory.jvm-overhead.max=429496736b
>>
>> ERROR StatusLogger No Log4j 2 configuration file found. Using default
>> configuration (logging only errors to the console), or user
>> programmatically provided configurations. Set system property
>> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
>> https://logging.apache.org/log4j/2.x/manual/configuration.html
>> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
>> for instructions on how to configure Log4j 2
>>
>> WARNING: An illegal reflective access operation has occurred
>>
>> WARNING: Illegal reflective access by
>> org.apache.flink.api.java.ClosureCleaner
>> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field
>> java.util.Properties.serialVersionUID
>>
>> WARNING: Please consider reporting this to the maintainers of
>> org.apache.flink.api.java.ClosureCleaner
>>
>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>> reflective access operations
>>
>> WARNING: All illegal access operations will be denied in a future release
>>
>>
>>
>>
>>
>> -------- The logs stops here, flink applications logs doesn’t get printed
>> here anymore---------
>>
>>
>>
>> ^C
>>
>> [root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
>>
>> sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only
>> file system
>>
>> sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only
>> file system
>>
>> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml:
>> Read-only file system
>>
>> sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only
>> file system
>>
>> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml:
>> Read-only file system
>>
>> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
>> Read-only file system
>>
>> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
>> -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712
>> -XX:MaxMetaspaceSize=268435456
>> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D
>> taskmanager.memory.framework.off-heap.size=134217728b -D
>> taskmanager.memory.network.max=166429984b -D
>> taskmanager.memory.network.min=166429984b -D
>> taskmanager.memory.framework.heap.size=134217728b -D
>> taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D
>> taskmanager.memory.task.heap.size=563714445b -D
>> taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf
>> -Djobmanager.memory.jvm-overhead.min='429496736b'
>> -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar'
>> -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1'
>> -Djobmanager.memory.off-heap.size='134217728b'
>> -Dexecution.target='embedded'
>> -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692'
>> -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar'
>> -Djobmanager.memory.jvm-metaspace.size='268435456b'
>> -Djobmanager.memory.heap.size='3462817376b'
>> -Djobmanager.memory.jvm-overhead.max='429496736b'
>>
>> ERROR StatusLogger No Log4j 2 configuration file found. Using default
>> configuration (logging only errors to the console), or user
>> programmatically provided configurations. Set system property
>> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
>> https://logging.apache.org/log4j/2.x/manual/configuration.html
>> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
>> for instructions on how to configure Log4j 2
>>
>> WARNING: An illegal reflective access operation has occurred
>>
>> WARNING: Illegal reflective access by
>> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
>> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method
>> java.nio.DirectByteBuffer.cleaner()
>>
>> WARNING: Please consider reporting this to the maintainers of
>> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
>>
>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>> reflective access operations
>>
>> WARNING: All illegal access operations will be denied in a future release
>>
>> Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
>>
>> SEVERE: attempt to configure ONS in FanManager failed with
>> oracle.ons.NoServersAvailable: Subscription time out
>>
>>
>>
>>
>>
>> -------- The logs stops here, flink applications logs doesn’t get printed
>> here anymore---------
>>
>>
>>
>>
>>
>> Best,
>>
>> Fuyao
>>
>>
>>
>>
>>
>> *From: *Fuyao Li <fu...@oracle.com>
>> *Date: *Friday, April 30, 2021 at 16:50
>> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
>> *Subject: *[External] : StopWithSavepoint() method doesn't work in Java
>> based flink native k8s operator
>>
>> Hello Community, Yang,
>>
>>
>>
>> I am trying to extend the flink native Kubernetes operator by adding some
>> new features based on the repo [1]. I wrote a method to release the image
>> update functionality. [2] I added the
>>
>> triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);
>>
>>
>>
>> under the existing method.
>>
>> triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);
>>
>>
>>
>>
>>
>> I wrote a function to accommodate the image change behavior.[2]
>>
>>
>>
>> Solution1:
>>
>> I want to use stopWithSavepoint() method to complete the task. However, I
>> found it will get stuck and never get completed. Even if I use get() for
>> the completeableFuture. It will always timeout and throw exceptions. See
>> solution 1 logs [3]
>>
>>
>>
>> Solution2:
>>
>> I tried to trigger a savepoint, then delete the deployment in the code
>> and then create a new application with new image. This seems to work fine.
>> Log link: [4]
>>
>>
>>
>> My questions:
>>
>>    1. Why solution 1 will get stuck? triggerSavepoint()
>>    CompleteableFuture could work here… Why stopWithSavepoint() will always get
>>    stuck or timeout? Very confused.
>>    2. For Fabric8io library, I am still new to it, did I do anything
>>    wrong in the implementation, maybe I should update the jobStatus? Please
>>    give me some suggestions.
>>    3. For work around solution 2, is there any bad influence I didn’t
>>    notice?
>>
>>
>>
>>
>>
>> [1] https://github.com/wangyang0918/flink-native-k8s-operator
>> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
>>
>> [2] https://pastebin.ubuntu.com/p/tQShjmdcJt/
>> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
>>
>> [3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/
>> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
>>
>> [4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/
>> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>
>>
>>
>>
>> Best,
>>
>> Fuyao
>>
>

Re: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello Matthias, Austin, Yang,

Just to offer a bit more information. It is a demo application and actually there is no real data send into the system when I was trying to test this behavior. So, Flink state is very small.
I also tried to trigger the REST API for stopWithSavepoint() inside the JM pod (the k8s operator by default uses ClusterIP, I can’t access it directly, The screenshot WebUI below is exposed through Ingress). The command I use here is:

root@flink-demo-594946fd7b-l4qn5:/opt/flink# curl -H "Content-type: application/json" -d '{
>     "drain" : false,
>     "targetDirectory" : null
> }' 'localhost:8081/jobs/869cc58f996b2c35137f2e874b692537/stop'
{"request-id":"cc36976826a625406e5ead2425317527"}

As you can see, I received a request id. After some time (> 5 mins), I don’t see the pod or the deployment is destroyed or shut down. The job is still in running, but NO new checkpoints is triggered. That leaves me the impression of the suspended but not terminated. I can see the screenshot below. As you can see, the pod is still running (That’s the reason why I can see such UI...)
[Graphical user interface, text, application, email  Description automatically generated]


[Graphical user interface, application, table  Description automatically generated]


According to your statement, this is not the expected behavior… Any ideas to fix this?

For the logs, I found a related Flink thread in Chinese[2] and reference the whole problem here [1]. This repo is the starting point of my k8s operator.

[1] https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
[2] http://apache-flink.147419.n8.nabble.com/flink-1-11-on-kubernetes-tt4586.html#none

Thanks,

Best,
Fuyao




From: Matthias Pohl <ma...@ververica.com>
Date: Monday, May 3, 2021 at 22:59
To: Fuyao Li <fu...@oracle.com>
Cc: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>, Austin Cawley-Edwards <au...@ververica.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't only suspend but terminate the job. Is it that you might have a larger state that makes creating the savepoint take longer? Even though, considering that you don't experience this behavior with your 2nd solution, I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as well already. Maybe, he can also give you more insights on the logging issue which would enable us to dig deeper into what's going on with stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com>> wrote:
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>>, Yang Wang <da...@gmail.com>>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Matthias Pohl <ma...@ververica.com>.
Hi Fuyao,
sorry for not replying earlier. The stop-with-savepoint operation shouldn't
only suspend but terminate the job. Is it that you might have a larger
state that makes creating the savepoint take longer? Even though,
considering that you don't experience this behavior with your 2nd solution,
I'd assume that we could ignore this possibility.

I'm gonna add Austin to the conversation as he worked with k8s operators as
well already. Maybe, he can also give you more insights on the logging
issue which would enable us to dig deeper into what's going on with
stop-with-savepoint.

Best,
Matthias

On Tue, May 4, 2021 at 4:33 AM Fuyao Li <fu...@oracle.com> wrote:

> Hello,
>
>
>
> Update:
>
> I think stopWithSavepoint() only suspend the job. It doesn’t actually
> terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint()
> and it works here.
>
>
>
> Maybe stopWithSavepoint() should only be used to update the configurations
> like parallelism? For updating the image, this seems to be not suitable,
> please correct me if I am wrong.
>
>
>
> For the log issue, I am still a bit confused. Why it is not available in
> kubectl logs. How should I get access to it?
>
>
>
> Thanks.
>
> Best,
>
> Fuyao
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Sunday, May 2, 2021 at 00:36
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
> Java based flink native k8s operator
>
> Hello,
>
>
>
> I noticed that first trigger a savepoint and then delete the deployment
> might cause the duplicate data issue. That could pose a bad influence to
> the semantic correctness. Please give me some hints on how to make the
> stopWithSavepoint() work correctly with Fabric8io Java k8s client to
> perform this image update operation. Thanks!
>
>
>
> Best,
>
> Fuyao
>
>
>
>
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Friday, April 30, 2021 at 18:03
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : Re: StopWithSavepoint() method doesn't work in
> Java based flink native k8s operator
>
> Hello Community, Yang,
>
>
>
> I have one more question for logging. I also noticed that if I execute
> kubectl logs  command to the JM. The pods provisioned by the operator can’t
> print out the internal Flink logs in the kubectl logs. I can only get
> something like the logs below. No actual flink logs is printed here… Where
> can I find the path to the logs? Maybe use a sidecar container to get it
> out? How can I get the logs without checking the Flink WebUI? Also, the sed
> error makes me confused here. In fact, the application is already up and
> running correctly if I access the WebUI through Ingress.
>
>
>
> Reference:
> https://github.com/wangyang0918/flink-native-k8s-operator/issues/4
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>
>
>
>
>
>
> [root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk
>
>
>
> sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only
> file system
>
> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only
> file system
>
> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
> Read-only file system
>
> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
> -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
> -D jobmanager.memory.off-heap.size=134217728b -D
> jobmanager.memory.jvm-overhead.min=429496736b -D
> jobmanager.memory.jvm-metaspace.size=268435456b -D
> jobmanager.memory.heap.size=3462817376b -D
> jobmanager.memory.jvm-overhead.max=429496736b
>
> ERROR StatusLogger No Log4j 2 configuration file found. Using default
> configuration (logging only errors to the console), or user
> programmatically provided configurations. Set system property
> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
> https://logging.apache.org/log4j/2.x/manual/configuration.html
> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
> for instructions on how to configure Log4j 2
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by
> org.apache.flink.api.java.ClosureCleaner
> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field
> java.util.Properties.serialVersionUID
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.flink.api.java.ClosureCleaner
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
>
>
>
>
> -------- The logs stops here, flink applications logs doesn’t get printed
> here anymore---------
>
>
>
> ^C
>
> [root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
>
> sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only
> file system
>
> /docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only
> file system
>
> /docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only
> file system
>
> /docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp:
> Read-only file system
>
> Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH
> -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712
> -XX:MaxMetaspaceSize=268435456
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=166429984b -D
> taskmanager.memory.network.min=166429984b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=563714445b -D
> taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf
> -Djobmanager.memory.jvm-overhead.min='429496736b'
> -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar'
> -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1'
> -Djobmanager.memory.off-heap.size='134217728b'
> -Dexecution.target='embedded'
> -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692'
> -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar'
> -Djobmanager.memory.jvm-metaspace.size='268435456b'
> -Djobmanager.memory.heap.size='3462817376b'
> -Djobmanager.memory.jvm-overhead.max='429496736b'
>
> ERROR StatusLogger No Log4j 2 configuration file found. Using default
> configuration (logging only errors to the console), or user
> programmatically provided configurations. Set system property
> 'log4j2.debug' to show Log4j 2 internal initialization logging. See
> https://logging.apache.org/log4j/2.x/manual/configuration.html
> <https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$>
> for instructions on how to configure Log4j 2
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by
> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
> (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method
> java.nio.DirectByteBuffer.cleaner()
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
>
> SEVERE: attempt to configure ONS in FanManager failed with
> oracle.ons.NoServersAvailable: Subscription time out
>
>
>
>
>
> -------- The logs stops here, flink applications logs doesn’t get printed
> here anymore---------
>
>
>
>
>
> Best,
>
> Fuyao
>
>
>
>
>
> *From: *Fuyao Li <fu...@oracle.com>
> *Date: *Friday, April 30, 2021 at 16:50
> *To: *user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
> *Subject: *[External] : StopWithSavepoint() method doesn't work in Java
> based flink native k8s operator
>
> Hello Community, Yang,
>
>
>
> I am trying to extend the flink native Kubernetes operator by adding some
> new features based on the repo [1]. I wrote a method to release the image
> update functionality. [2] I added the
>
> triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);
>
>
>
> under the existing method.
>
> triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);
>
>
>
>
>
> I wrote a function to accommodate the image change behavior.[2]
>
>
>
> Solution1:
>
> I want to use stopWithSavepoint() method to complete the task. However, I
> found it will get stuck and never get completed. Even if I use get() for
> the completeableFuture. It will always timeout and throw exceptions. See
> solution 1 logs [3]
>
>
>
> Solution2:
>
> I tried to trigger a savepoint, then delete the deployment in the code and
> then create a new application with new image. This seems to work fine. Log
> link: [4]
>
>
>
> My questions:
>
>    1. Why solution 1 will get stuck? triggerSavepoint()
>    CompleteableFuture could work here… Why stopWithSavepoint() will always get
>    stuck or timeout? Very confused.
>    2. For Fabric8io library, I am still new to it, did I do anything
>    wrong in the implementation, maybe I should update the jobStatus? Please
>    give me some suggestions.
>    3. For work around solution 2, is there any bad influence I didn’t
>    notice?
>
>
>
>
>
> [1] https://github.com/wangyang0918/flink-native-k8s-operator
> <https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
>
> [2] https://pastebin.ubuntu.com/p/tQShjmdcJt/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
>
> [3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
>
> [4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/
> <https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>
>
>
>
> Best,
>
> Fuyao
>

Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello,

Update:
I think stopWithSavepoint() only suspend the job. It doesn’t actually terminate (./bin/flink cancel) the job. I switched to cancelWithSavepoint() and it works here.

Maybe stopWithSavepoint() should only be used to update the configurations like parallelism? For updating the image, this seems to be not suitable, please correct me if I am wrong.

For the log issue, I am still a bit confused. Why it is not available in kubectl logs. How should I get access to it?

Thanks.
Best,
Fuyao

From: Fuyao Li <fu...@oracle.com>
Date: Sunday, May 2, 2021 at 00:36
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello,

I noticed that first trigger a savepoint and then delete the deployment might cause the duplicate data issue. That could pose a bad influence to the semantic correctness. Please give me some hints on how to make the stopWithSavepoint() work correctly with Fabric8io Java k8s client to perform this image update operation. Thanks!

Best,
Fuyao



From: Fuyao Li <fu...@oracle.com>
Date: Friday, April 30, 2021 at 18:03
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator/issues/4__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptEJWo5vM$>


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html<https://urldefense.com/v3/__https:/logging.apache.org/log4j/2.x/manual/configuration.html__;!!GqivPVa7Brio!PZPkOj4s7du8ItEG-AxKGR2EN6pWDuKfwcjZNKbpLfhXHRD3IoaH6zptpRoiZsE$> for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao

Re: StopWithSavepoint() method doesn't work in Java based flink native k8s operator

Posted by Fuyao Li <fu...@oracle.com>.
Hello Community, Yang,

I have one more question for logging. I also noticed that if I execute kubectl logs  command to the JM. The pods provisioned by the operator can’t print out the internal Flink logs in the kubectl logs. I can only get something like the logs below. No actual flink logs is printed here… Where can I find the path to the logs? Maybe use a sidecar container to get it out? How can I get the logs without checking the Flink WebUI? Also, the sed error makes me confused here. In fact, the application is already up and running correctly if I access the WebUI through Ingress.

Reference: https://github.com/wangyang0918/flink-native-k8s-operator/issues/4


[root@bastion deploy]# kubectl logs -f flink-demo-594946fd7b-822xk

sed: couldn't open temporary file /opt/flink/conf/sedh1M3oO: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sed8TqlNR: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedvO2DFU: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx3462817376 -Xms3462817376 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint -D jobmanager.memory.off-heap.size=134217728b -D jobmanager.memory.jvm-overhead.min=429496736b -D jobmanager.memory.jvm-metaspace.size=268435456b -D jobmanager.memory.heap.size=3462817376b -D jobmanager.memory.jvm-overhead.max=429496736b
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to field java.util.Properties.serialVersionUID
WARNING: Please consider reporting this to the maintainers of org.apache.flink.api.java.ClosureCleaner
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------

^C
[root@bastion deploy]# kubectl logs -f flink-demo-taskmanager-1-1
sed: couldn't open temporary file /opt/flink/conf/sedaNDoNR: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/seddze7tQ: Read-only file system
/docker-entrypoint.sh: line 75: /opt/flink/conf/flink-conf.yaml: Read-only file system
sed: couldn't open temporary file /opt/flink/conf/sedYveZoT: Read-only file system
/docker-entrypoint.sh: line 88: /opt/flink/conf/flink-conf.yaml: Read-only file system
/docker-entrypoint.sh: line 90: /opt/flink/conf/flink-conf.yaml.tmp: Read-only file system
Start command: $JAVA_HOME/bin/java -classpath $FLINK_CLASSPATH -Xmx697932173 -Xms697932173 -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456 org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=166429984b -D taskmanager.memory.network.min=166429984b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=665719939b -D taskmanager.cpu.cores=1.0 -D taskmanager.memory.task.heap.size=563714445b -D taskmanager.memory.task.off-heap.size=0b --configDir /opt/flink/conf -Djobmanager.memory.jvm-overhead.min='429496736b' -Dpipeline.classpaths='file:usrlib/quickstart-0.1.jar' -Dtaskmanager.resource-id='flink-demo-taskmanager-1-1' -Djobmanager.memory.off-heap.size='134217728b' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-d7691661-fac5-494e-8154-896b4fe30692' -Dpipeline.jars='file:/opt/flink/usrlib/quickstart-0.1.jar' -Djobmanager.memory.jvm-metaspace.size='268435456b' -Djobmanager.memory.heap.size='3462817376b' -Djobmanager.memory.jvm-overhead.max='429496736b'
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.12.1.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Apr 29, 2021 12:58:34 AM oracle.simplefan.impl.FanManager configure
SEVERE: attempt to configure ONS in FanManager failed with oracle.ons.NoServersAvailable: Subscription time out


-------- The logs stops here, flink applications logs doesn’t get printed here anymore---------


Best,
Fuyao


From: Fuyao Li <fu...@oracle.com>
Date: Friday, April 30, 2021 at 16:50
To: user <us...@flink.apache.org>, Yang Wang <da...@gmail.com>
Subject: [External] : StopWithSavepoint() method doesn't work in Java based flink native k8s operator
Hello Community, Yang,

I am trying to extend the flink native Kubernetes operator by adding some new features based on the repo [1]. I wrote a method to release the image update functionality. [2] I added the
triggerImageUpdate(oldFlinkApp, flinkApp, effectiveConfig);

under the existing method.

triggerSavepoint(oldFlinkApp, flinkApp, effectiveConfig);


I wrote a function to accommodate the image change behavior.[2]

Solution1:
I want to use stopWithSavepoint() method to complete the task. However, I found it will get stuck and never get completed. Even if I use get() for the completeableFuture. It will always timeout and throw exceptions. See solution 1 logs [3]

Solution2:
I tried to trigger a savepoint, then delete the deployment in the code and then create a new application with new image. This seems to work fine. Log link: [4]

My questions:

  1.  Why solution 1 will get stuck? triggerSavepoint() CompleteableFuture could work here… Why stopWithSavepoint() will always get stuck or timeout? Very confused.
  2.  For Fabric8io library, I am still new to it, did I do anything wrong in the implementation, maybe I should update the jobStatus? Please give me some suggestions.
  3.  For work around solution 2, is there any bad influence I didn’t notice?


[1] https://github.com/wangyang0918/flink-native-k8s-operator<https://urldefense.com/v3/__https:/github.com/wangyang0918/flink-native-k8s-operator__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijNSMY0DI$>
[2] https://pastebin.ubuntu.com/p/tQShjmdcJt/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/tQShjmdcJt/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijoiwPw-I$>
[3] https://pastebin.ubuntu.com/p/YHSPpK4W4Z/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/YHSPpK4W4Z/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijmgfSmqs$>
[4] https://pastebin.ubuntu.com/p/3VG7TtXXfh/<https://urldefense.com/v3/__https:/pastebin.ubuntu.com/p/3VG7TtXXfh/__;!!GqivPVa7Brio!PJIKFBi86alhx1DCxiWp8FkWKToD8XC8tNHFFrYSZj3AKM3zqyiNRjijr_tizPo$>

Best,
Fuyao