You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "wgcn (JIRA)" <ji...@apache.org> on 2018/10/30 09:52:00 UTC

[jira] [Commented] (FLINK-5770) Flink yarn session stop in non-detached model

    [ https://issues.apache.org/jira/browse/FLINK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668448#comment-16668448 ] 

wgcn commented on FLINK-5770:
-----------------------------

your client maybe shutdown   you can use the arg -d 

> Flink yarn session stop in non-detached model
> ---------------------------------------------
>
>                 Key: FLINK-5770
>                 URL: https://issues.apache.org/jira/browse/FLINK-5770
>             Project: Flink
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.2.0
>         Environment: 1、the cluster contains 4 nodes;
> 2、every node has 380GB memory, and the CPU has  40 cores;
> 3、the OS is centOS7.2;
>            Reporter: zhangrucong1982
>            Priority: Major
>
> 1、I user the recent version of flink, and use fink in security mode without HA.the configurations in flink-conf.yaml are:
> security.kerberos.login.keytab: /home/demo/flink/release/flink-1.2.2/keytab/huawei1.keytab
> security.kerberos.login.principal: huawei1
> security.kerberos.login.contexts: Client,KafkaClient
> 2、then I use the command ./yarn-session.sh -n 2  to start the cluster with two taskmanagers.
> 3、 But About the 4 hours later, the session is shutting down by itself. the error stack is following:
> 2017-02-07 19:27:30,841 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@9-96-101-251:38650] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
> 2017-02-07 19:27:42,804 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Exception while running the interactive command line interface
> java.lang.RuntimeException: Unable to get ClusterClient status from Application Client
>         at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248)
>         at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:410)
>         at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:663)
>         at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:476)
>         at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:473)
>         at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
>         at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:473)
> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:142)
>         at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:691)
>         at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
>         ... 10 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at scala.concurrent.Await.result(package.scala)
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:140)
>         ... 12 more
> 4、the detail log you can see in the following :
> https://docs.google.com/document/d/1mbxrCy6mHHFxcxPv8f7CCA3BI1QVGPeNiHxUQhuZP0o/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)