You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by yanjie <gy...@qq.com> on 2021/08/25 02:28:25 UTC

回复: AdaptiveScheduler stopped without exception

&nbsp;@Till Rohrmann, Thanks for your&nbsp;clear explanation




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "Till Rohrmann"                                                                                    <trohrmann@apache.org&gt;;
发送时间:&nbsp;2021年8月24日(星期二) 晚上8:51
收件人:&nbsp;"yanjie"<gyj199482@qq.com&gt;;
抄送:&nbsp;"user"<user@flink.apache.org&gt;;
主题:&nbsp;Re: AdaptiveScheduler stopped without exception



Hi Yanjie,

The observed exception in the logs is just a side effect of the shut down procedure. It is a bug that shutting down the Dispatcher will result in a fatal exception coming from the&nbsp;ApplicationDispatcherBootstrap. I've created a ticket in order to fix it [1].


The true reason for stopping the&nbsp;SessionDispatcherLeaderProcess is that the DefaultDispatcherRunner&nbsp;lost its leadership. Unfortunately, we don't log this event on info. If you enable debug log level then you should&nbsp;see it. What happens when the Dispatcher loses leadership is that the Dispatcher component will be stopped. I will improve the logging of the DefaultDispatcherRunner to better state when it gains and loses leadership [2]. I hope this will make the logs easier to understand.


In the second job manager log, it is effectively the same. Just with the difference that first the ResourceManager loses its leadership. It seems as if the cause for the leadership loss could be that 172.18.0.1:443 is no longer reachable (probably the K8s API server).


[1]&nbsp;https://issues.apache.org/jira/browse/FLINK-23946
[2]&nbsp;https://issues.apache.org/jira/browse/FLINK-23947


Cheers,
Till


On Tue, Aug 24, 2021 at 9:56 AM yanjie <gyj199482@qq.com&gt; wrote:

Hi all,&nbsp;


I run a Application Cluster on Azure K8s, the job works fine for a duration, then jobmanager catches an exception:


org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped.	at org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:415) ~[flink-dist_2.11-1.13.0.jar:1.13.0]
	at org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962) ~[flink-dist_2.11-1.13.0.jar:1.13.0]
	at org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926) ~[flink-dist_2.11-1.13.0.jar:1.13.0]
	at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398) ~[flink-dist_2.11-1.13.0.jar:1.13.0]
&nbsp; &nbsp; ...... omit

without any other exception before. Then jobmanager executes stopping steps and shutdown.
Because there's no other exception before, I don't know why 'AdaptiveScheduler is being stopped'.


My question:
What causes this issue(flink-jobmanager-1593852-jgwjt.log)?
Is network issuse caused this exception?(as encountered in flink-jobmanager-1593852-kr22z.log)?
Why first jobmanager(flink-jobmanager-1593852-jgwjt) doesn't throw any exception before?


Logs:
Attached log files contain jobmanager&amp;taskmanager's log. I configure k8s-HA with jobmanager's parallelism=1 (Whether set jobmangert's parallelism=1 or 2, both will recurrent)
flink-jobmanager-1593852-jgwjt.log:&nbsp;&nbsp;
works fine until '2021-08-23 05:08:25'


flink-jobmanager-1593852-kr22z.log:&nbsp;
start from '2021-08-23 05:08:35' and restore my job, works fine for a duration, then at '2021-08-23 14:24:15'&nbsp;
, jobmanager looks like occur network issue (may be Azure k8s's network issue, lead to flink cann't operate configmap, loose leader after k8s-ha lease duration).&nbsp;
Until '2021-08-23 14:24:32', this jobmanager catch exception 'AdaptiveScheduler is being stopped' again, and then shutdown.


flink--taskexecutor-0-flink-taskmanager-1593852-56dfcd95bc-hvnps.log:
Contains taskmanager's logs from beginning to '2021-08-23 09:15:24'. Covered the first jobmanager's (flink-jobmanager-1593852-jgwjt) lifecircle.




Background:
Deployment&amp;Configuration
I follow this page : https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster
deploy a Application Cluster to run my job. And add configurations for high availability on Kubernetes and use reactive scheduler mode.
Attached yaml files contain 'flink-config' &amp; 'flink-jobmanager' &amp; 'flink-taskmanager' configurations.


Other experiences
In the previous test, when deploy my flink job on Azure K8s cluster, I encounter 'network issue' once, this issue will lead to master jobmanager can't renew configmap for a while,&nbsp;
and then the standby jobmanager will be elected as leader, then when previous leader's network recovered, it knows it is not a leader any more, then shutdown. Because k8s's default
configuration 'backoffLimit=6', my flink job will be removed finally.
I'm fixing this issue by increasing k8s ha's configurations, as this official docment says: https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#advanced-high-availability-kubernetes-options




My analyse:
Both jobmanager's log files contain same exception: 'AdaptiveScheduler is being stopped'. First jobmanager doesn't print any exception before.
The second jobmanager's print network exception, this may infer that this is caused by a network issue.

And I really encounter 'network issue' in the previous test and the fix job is on going, May be this exception is also caused by 'network issue'.


The reason why I&nbsp;raised this is: the first jobmanager doesn't print any information, I wonder why this happens.