You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Sridhar Chellappa <fl...@gmail.com> on 2017/09/11 09:17:38 UTC

Cannot deploy Flink on YARN

I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by issuing
the following command:

~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d

I am seeing a flurry of these Errors:

2017-09-11 08:17:11,410 INFO
org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment
took more than 60 seconds. Please check if the requested resources are
available in the YARN cluster
2017-09-11 08:17:11,661 INFO
org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment
took more than 60 seconds. Please check if the requested resources are
available in the YARN cluster
2017-09-11 08:17:11,912 INFO
org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment
took more than 60 seconds. Please check if the requested resources are
available in the YARN cluster
2017-09-11 08:17:12,163 INFO
org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment
took more than 60 seconds. Please check if the requested resources are
available in the YARN cluster


And then, my deployment fails with the following exception :

Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
    at
org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:439)
    at
org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630)
    at
org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486)
    at
org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:483)
    at
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
    at
org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:483)
Caused by:
org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException:
The YARN application unexpectedly switched to state FAILED during
deployment.
Diagnostics from YARN: Application application_1504851547322_0003 failed 2
times due to AM Container for appattempt_1504851547322_0003_000002 exited
with  exitCode: 31
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1504851547322_0003_02_000001
Exit code: 31
Stack trace: ExitCodeException exitCode=31:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
    at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)



Further Debugging at the JobManager logs shows :

Resetting connection and trying again with a new connection.
2017-09-11 08:17:11,820 INFO  org.apache.zookeeper.ZooKeeper
                     - Initiating client connection,
connectString=high-availability.zookeeper.quorum:
10.200.0.6:2181,10.200.0.7:2181,10.200.0.9:2181 sessionTimeout=60000
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b
2017-09-11 08:17:11,927 ERROR
org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN
Application Master initialization failed
java.net.UnknownHostException: high-availability.zookeeper.quorum:
10.200.0.6: Name or service not known
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)


any help in figuring this out will be appreciated

Re: Cannot deploy Flink on YARN

Posted by Aljoscha Krettek <al...@apache.org>.
Since you're running in a container, the question is whether the container where the JM is running can access the ZooKeeper at 10.200.0.6.

> On 27. Sep 2017, at 04:31, Sridhar Chellappa <fl...@gmail.com> wrote:
> 
> Emily,
> 
> I did not get  chance to capture the logs on the container. Since I have erased the instances, I have lost access to the logs. I have moved to no-ha mode (single master) and running OK.
> 
> Aljoscha,
> 
> Network connectivity is good. I am able to ssh to 10.200.0.6. 
> 
> 
> Will try the HA mode and capture all the logs and send them over
> 
> On Tue, Sep 26, 2017 at 6:37 PM, Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> wrote:
> Is the IP 10.200.0.6 reachable form the machine that runs the JobManager?
> 
>> On 25. Sep 2017, at 19:58, Emily McMahon <emilymc@remitly.com <ma...@remitly.com>> wrote:
>> 
>> What's in the container log for the container that failed? 
>> 
>> On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <flinkenthu@gmail.com <ma...@gmail.com>> wrote:
>> I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by issuing the following command:
>> 
>> ~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d
>> 
>> I am seeing a flurry of these Errors:
>> 
>> 2017-09-11 08:17:11,410 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
>> 2017-09-11 08:17:11,661 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
>> 2017-09-11 08:17:11,912 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
>> 2017-09-11 08:17:12,163 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
>> 
>> 
>> And then, my deployment fails with the following exception :
>> 
>> Error while deploying YARN cluster: Couldn't deploy Yarn cluster
>> java.lang.RuntimeException: Couldn't deploy Yarn cluster
>>     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:439)
>>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630)
>>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486)
>>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:483)
>>     at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.do <http://javax.security.auth.subject.do/>As(Subject.java:422)
>>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>>     at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
>>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:483)
>> Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
>> Diagnostics from YARN: Application application_1504851547322_0003 failed 2 times due to AM Container for appattempt_1504851547322_0003_000002 exited with  exitCode: 31
>> Failing this attempt.Diagnostics: Exception from container-launch.
>> Container id: container_1504851547322_0003_02_000001
>> Exit code: 31
>> Stack trace: ExitCodeException exitCode=31:
>>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
>>     at org.apache.hadoop.util.Shell.run(Shell.java:869)
>>     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
>>     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
>>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
>>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>     at java.lang.Thread.run(Thread.java:748)
>> 
>> 
>> 
>> Further Debugging at the JobManager logs shows :
>> 
>> Resetting connection and trying again with a new connection.
>> 2017-09-11 08:17:11,820 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=high-availability.zookeeper.quorum: 10.200.0.6:2181 <http://10.200.0.6:2181/>,10.200.0.7:2181 <http://10.200.0.7:2181/>,10.200.0.9:2181 <http://10.200.0.9:2181/> sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b
>> 2017-09-11 08:17:11,927 ERROR org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN Application Master initialization failed
>> java.net.UnknownHostException: high-availability.zookeeper.quorum: 10.200.0.6 <http://10.200.0.6/>: Name or service not known
>> 	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
>> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
>> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
>> 	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>> 	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>> 	at org.apache.zookeeper.client.St <http://org.apache.zookeeper.client.st/>aticHostProvider.<init>(StaticHostProvider.java:61)
>> 
>> 
>> any help in figuring this out will be appreciated
>> 
> 
> 


Re: Cannot deploy Flink on YARN

Posted by Sridhar Chellappa <fl...@gmail.com>.
Emily,

I did not get  chance to capture the logs on the container. Since I have
erased the instances, I have lost access to the logs. I have moved to no-ha
mode (single master) and running OK.

Aljoscha,

Network connectivity is good. I am able to ssh to 10.200.0.6.


Will try the HA mode and capture all the logs and send them over

On Tue, Sep 26, 2017 at 6:37 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Is the IP 10.200.0.6 reachable form the machine that runs the JobManager?
>
> On 25. Sep 2017, at 19:58, Emily McMahon <em...@remitly.com> wrote:
>
> What's in the container log for the container that failed?
>
> On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <fl...@gmail.com> wrote:
>
> I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by
> issuing the following command:
>
> ~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d
>
> I am seeing a flurry of these Errors:
>
> 2017-09-11 08:17:11,410 INFO  org.apache.flink.yarn.YarnClus
> terDescriptor                   - Deployment took more than 60 seconds.
> Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:11,661 INFO  org.apache.flink.yarn.YarnClus
> terDescriptor                   - Deployment took more than 60 seconds.
> Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:11,912 INFO  org.apache.flink.yarn.YarnClus
> terDescriptor                   - Deployment took more than 60 seconds.
> Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:12,163 INFO  org.apache.flink.yarn.YarnClus
> terDescriptor                   - Deployment took more than 60 seconds.
> Please check if the requested resources are available in the YARN cluster
>
>
> And then, my deployment fails with the following exception :
>
> Error while deploying YARN cluster: Couldn't deploy Yarn cluster
> java.lang.RuntimeException: Couldn't deploy Yarn cluster
>     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(A
> bstractYarnClusterDescriptor.java:439)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnS
> essionCli.java:630)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYa
> rnSessionCli.java:486)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYa
> rnSessionCli.java:483)
>     at org.apache.flink.runtime.security.HadoopSecurityContext$1.
> run(HadoopSecurityContext.java:43)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1548)
>     at org.apache.flink.runtime.security.HadoopSecurityContext.runS
> ecured(HadoopSecurityContext.java:40)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarn
> SessionCli.java:483)
> Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException:
> The YARN application unexpectedly switched to state FAILED during
> deployment.
> Diagnostics from YARN: Application application_1504851547322_0003 failed 2
> times due to AM Container for appattempt_1504851547322_0003_000002 exited
> with  exitCode: 31
> Failing this attempt.Diagnostics: Exception from container-launch.
> Container id: container_1504851547322_0003_02_000001
> Exit code: 31
> Stack trace: ExitCodeException exitCode=31:
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
>     at org.apache.hadoop.util.Shell.run(Shell.java:869)
>     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Sh
> ell.java:1170)
>     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx
> ecutor.launchContainer(DefaultContainerExecutor.java:236)
>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.l
> auncher.ContainerLaunch.call(ContainerLaunch.java:305)
>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.l
> auncher.ContainerLaunch.call(ContainerLaunch.java:84)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
>
>
>
> Further Debugging at the JobManager logs shows :
>
> Resetting connection and trying again with a new connection.
> 2017-09-11 08:17:11,820 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=high-availability.zookeeper.quorum: 10.200.0.6:2181,10.200.0.7:2181,10.200.0.9:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b
> 2017-09-11 08:17:11,927 ERROR org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN Application Master initialization failed
> java.net.UnknownHostException: high-availability.zookeeper.quorum: 10.200.0.6: Name or service not known
> 	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> 	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
>
>
> any help in figuring this out will be appreciated
>
>
>
>

Re: Cannot deploy Flink on YARN

Posted by Aljoscha Krettek <al...@apache.org>.
Is the IP 10.200.0.6 reachable form the machine that runs the JobManager?

> On 25. Sep 2017, at 19:58, Emily McMahon <em...@remitly.com> wrote:
> 
> What's in the container log for the container that failed? 
> 
> On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <flinkenthu@gmail.com <ma...@gmail.com>> wrote:
> I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by issuing the following command:
> 
> ~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d
> 
> I am seeing a flurry of these Errors:
> 
> 2017-09-11 08:17:11,410 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:11,661 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:11,912 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
> 2017-09-11 08:17:12,163 INFO  org.apache.flink.yarn.YarnClusterDescriptor                   - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
> 
> 
> And then, my deployment fails with the following exception :
> 
> Error while deploying YARN cluster: Couldn't deploy Yarn cluster
> java.lang.RuntimeException: Couldn't deploy Yarn cluster
>     at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:439)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:630)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:486)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:483)
>     at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>     at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
>     at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:483)
> Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
> Diagnostics from YARN: Application application_1504851547322_0003 failed 2 times due to AM Container for appattempt_1504851547322_0003_000002 exited with  exitCode: 31
> Failing this attempt.Diagnostics: Exception from container-launch.
> Container id: container_1504851547322_0003_02_000001
> Exit code: 31
> Stack trace: ExitCodeException exitCode=31:
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
>     at org.apache.hadoop.util.Shell.run(Shell.java:869)
>     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
>     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
>     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> 
> 
> 
> Further Debugging at the JobManager logs shows :
> 
> Resetting connection and trying again with a new connection.
> 2017-09-11 08:17:11,820 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=high-availability.zookeeper.quorum: 10.200.0.6:2181 <http://10.200.0.6:2181/>,10.200.0.7:2181 <http://10.200.0.7:2181/>,10.200.0.9:2181 <http://10.200.0.9:2181/> sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b
> 2017-09-11 08:17:11,927 ERROR org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN Application Master initialization failed
> java.net.UnknownHostException: high-availability.zookeeper.quorum: 10.200.0.6 <http://10.200.0.6/>: Name or service not known
> 	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> 	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
> 
> 
> any help in figuring this out will be appreciated
> 


Re: Cannot deploy Flink on YARN

Posted by Emily McMahon <em...@remitly.com>.
What's in the container log for the container that failed?

On Sep 11, 2017 2:17 AM, "Sridhar Chellappa" <fl...@gmail.com> wrote:

I am trying to start Flink(Version 1.3.0) on YARN (Hadoop 2.8.1) by issuing
the following command:

~/flink-1.3.0/bin/yarn-session.sh -s 4 -n 10 -jm 4096 -tm 4096-d

I am seeing a flurry of these Errors:

2017-09-11 08:17:11,410 INFO  org.apache.flink.yarn.
YarnClusterDescriptor                   - Deployment took more than 60
seconds. Please check if the requested resources are available in the YARN
cluster
2017-09-11 08:17:11,661 INFO  org.apache.flink.yarn.
YarnClusterDescriptor                   - Deployment took more than 60
seconds. Please check if the requested resources are available in the YARN
cluster
2017-09-11 08:17:11,912 INFO  org.apache.flink.yarn.
YarnClusterDescriptor                   - Deployment took more than 60
seconds. Please check if the requested resources are available in the YARN
cluster
2017-09-11 08:17:12,163 INFO  org.apache.flink.yarn.
YarnClusterDescriptor                   - Deployment took more than 60
seconds. Please check if the requested resources are available in the YARN
cluster


And then, my deployment fails with the following exception :

Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
    at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(
AbstractYarnClusterDescriptor.java:439)
    at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(
FlinkYarnSessionCli.java:630)
    at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(
FlinkYarnSessionCli.java:486)
    at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(
FlinkYarnSessionCli.java:483)
    at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
HadoopSecurityContext.java:43)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1548)
    at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(
HadoopSecurityContext.java:40)
    at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(
FlinkYarnSessionCli.java:483)
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException:
The YARN application unexpectedly switched to state FAILED during
deployment.
Diagnostics from YARN: Application application_1504851547322_0003 failed 2
times due to AM Container for appattempt_1504851547322_0003_000002 exited
with  exitCode: 31
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1504851547322_0003_02_000001
Exit code: 31
Stack trace: ExitCodeException exitCode=31:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.
launchContainer(DefaultContainerExecutor.java:236)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.
launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.
launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)



Further Debugging at the JobManager logs shows :

Resetting connection and trying again with a new connection.
2017-09-11 08:17:11,820 INFO  org.apache.zookeeper.ZooKeeper
                     - Initiating client connection,
connectString=high-availability.zookeeper.quorum:
10.200.0.6:2181,10.200.0.7:2181,10.200.0.9:2181 sessionTimeout=60000
watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@57bd802b
2017-09-11 08:17:11,927 ERROR
org.apache.flink.yarn.YarnApplicationMasterRunner             - YARN
Application Master initialization failed
java.net.UnknownHostException: high-availability.zookeeper.quorum:
10.200.0.6: Name or service not known
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)


any help in figuring this out will be appreciated