You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/10/19 15:03:00 UTC

[jira] [Commented] (FLINK-10368) 'Kerberized YARN on Docker test' instable

    [ https://issues.apache.org/jira/browse/FLINK-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656901#comment-16656901 ] 

Till Rohrmann commented on FLINK-10368:
---------------------------------------

I encountered another problem:
{code}The program finished with the following exception:

org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
        at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:419)
        at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:261)
        at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
        at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
        at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The number of requested virtual cores per node 1 exceeds the maximum number of virtual cores 0 available in the Yarn Cluster. Please note that the number of virtual cores is set to the number of task slots by default unless configured in the Flink config with 'yarn.containers.vcores.'
        at org.apache.flink.yarn.AbstractYarnClusterDescriptor.isReadyForDeployment(AbstractYarnClusterDescriptor.java:299)
        at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:491)
        at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:412)
        ... 9 more
{code}
Apparently, the cluster thinks that it has only {{0}} vcores. This might be a setup issue.

> 'Kerberized YARN on Docker test' instable
> -----------------------------------------
>
>                 Key: FLINK-10368
>                 URL: https://issues.apache.org/jira/browse/FLINK-10368
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.5.3, 1.6.0, 1.7.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.7.0
>
>
> Running Kerberized YARN on Docker test end-to-end test failed on an AWS instance. The problem seems to be that the NameNode went into safe-mode due to limited resources.
> {code}
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/home/hadoop-user/flink-1.6.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2018-09-19 09:04:39,201 INFO  org.apache.hadoop.security.UserGroupInformation               - Login successful for user hadoop-user using keytab file /home/hadoop-user/hadoop-user.keytab
> 2018-09-19 09:04:39,453 INFO  org.apache.hadoop.yarn.client.RMProxy                         - Connecting to ResourceManager at master.docker-hadoop-cluster-network/172.22.0.3:8032
> 2018-09-19 09:04:39,640 INFO  org.apache.hadoop.yarn.client.AHSProxy                        - Connecting to Application History server at master.docker-hadoop-cluster-network/172.22.0.3:10200
> 2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2018-09-19 09:04:39,656 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2018-09-19 09:04:39,901 INFO  org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Cluster specification: ClusterSpecification{masterMemoryMB=2000, taskManagerMemoryMB=2000, numberTaskManagers=3, slotsPerTaskManager=1}
> 2018-09-19 09:04:40,286 WARN  org.apache.flink.yarn.AbstractYarnClusterDescriptor           - The configuration directory ('/home/hadoop-user/flink-1.6.1/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:420)
>         at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:259)
>         at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
>         at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
>         at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>         at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. Name node is in safe mode.
> Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:master.docker-hadoop-cluster-network
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>         at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
>         at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:270)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1274)
>         at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1216)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:470)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:470)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:411)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
>         at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
>         at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2002)
>         at org.apache.flink.yarn.Utils.setupLocalResource(Utils.java:162)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.setupSingleLocalResource(AbstractYarnClusterDescriptor.java:1139)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.access$000(AbstractYarnClusterDescriptor.java:111)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1200)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1188)
>         at java.nio.file.Files.walkFileTree(Files.java:2670)
>         at java.nio.file.Files.walkFileTree(Files.java:2742)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.uploadAndRegisterFiles(AbstractYarnClusterDescriptor.java:1188)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:800)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:542)
>         at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:413)
>         ... 9 more
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. Name node is in safe mode.
> Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:master.docker-hadoop-cluster-network
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1435)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1345)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy14.create(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
>         at com.sun.proxy.$Proxy15.create(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:265)
>         ... 33 more
> Running the Flink job failed, might be that the cluster is not ready yet. We have been trying for 795 seconds, retrying ...
> {code}
> I think it would be good to harden the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)