You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "kyungwan nam (JIRA)" <ji...@apache.org> on 2019/05/15 06:47:00 UTC

[jira] [Commented] (YARN-9521) RM filed to start due to system services

    [ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840096#comment-16840096 ] 

kyungwan nam commented on YARN-9521:
------------------------------------

I think the cause of this problem is as follows.

1. _fs_ is set by calling FileSystem.get() on SystemServiceManagerImpl.serviceInit

2. RMAppImpl.appAdminClientCleanUp will be called on RMAppImpl.FinalTransition, if APP_COMPLETED event occurs during RMStateStore recovery 

{code}
  static void appAdminClientCleanUp(RMAppImpl app) {
    try {
      AppAdminClient client = AppAdminClient.createAppAdminClient(app
          .applicationType, app.conf);
      int result = client.actionCleanUp(app.name, app.user);
{code}

ApiServiceClient.actionCleanUp
{code}
  @Override
  public int actionCleanUp(String appName, String userName) throws
      IOException, YarnException {
    ServiceClient sc = new ServiceClient();
    sc.init(getConfig());
    sc.start();
    int result = sc.actionCleanUp(appName, userName);
    sc.close();
    return result;
  }
{code}

ServiceClient instance has a FileSystem by calling FileSystem.get() at initialization time. but, it might be a cached one.
the FileSystem cached will be closed by _sc.close()_

3. scanForUserServices is called on SystemServiceManagerImpl.serviceStart. but, _fs_ has been closed already.



RM log

{code}

// 1. SystemServiceManagerImpl.serviceInit
//
2019-05-15 10:27:59,445 DEBUG service.AbstractService (AbstractService.java:enterState(443)) - Service: org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl entered state INITED
2019-05-15 10:27:59,446 INFO  client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is configured to /services
2019-05-15 10:27:59,472 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3209)) - Loading filesystems
2019-05-15 10:27:59,483 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - file:// = class org.apache.hadoop.fs.LocalFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,488 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - viewfs:// = class org.apache.hadoop.fs.viewfs.ViewFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,491 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - har:// = class org.apache.hadoop.fs.HarFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,492 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - http:// = class org.apache.hadoop.fs.http.HttpFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,493 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - https:// = class org.apache.hadoop.fs.http.HttpsFileSystem from /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,503 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - hdfs:// = class org.apache.hadoop.hdfs.DistributedFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,511 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - webhdfs:// = class org.apache.hadoop.hdfs.web.WebHdfsFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,512 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - swebhdfs:// = class org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from /usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:loadFileSystems(3221)) - s3n:// = class org.apache.hadoop.fs.s3native.NativeS3FileSystem from /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-aws-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs
2019-05-15 10:27:59,514 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3268)) - looking for configuration option fs.hdfs.impl
2019-05-15 10:27:59,528 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for implementation class
2019-05-15 10:27:59,528 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class org.apache.hadoop.hdfs.DistributedFileSystem

// 2. APP_COMPLETED event occurs
//
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:handle(895)) - Processing event for application_1556612756829_0001 of type RECOVER
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:recover(933)) - Recovering app: application_1556612756829_0001 with 2 attempts and final state = FAILED
2019-05-15 10:28:02,931 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:<init>(544)) - yarn.app.attempt.diagnostics.limit.kc : 64
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(969)) - Recovering attempt: appattempt_1556612756829_0001_000001 with final state = FAILED
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:<init>(544)) - yarn.app.attempt.diagnostics.limit.kc : 64
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(969)) - Recovering attempt: appattempt_1556612756829_0001_000002 with final state = FAILED
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(910)) - Processing event for appattempt_1556612756829_0001_000001 of type RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(930)) - appattempt_1556612756829_0001_000001 State change from NEW to FAILED on event = RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(910)) - Processing event for appattempt_1556612756829_0001_000002 of type RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(930)) - appattempt_1556612756829_0001_000002 State change from NEW to FAILED on event = RECOVER
2019-05-15 10:28:02,933 DEBUG event.AsyncDispatcher (AsyncDispatcher.java:dispatch(188)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEvent.EventType: APP_COMPLETED
2019-05-15 10:28:02,933 DEBUG resourcemanager.RMAppManager (RMAppManager.java:handle(573)) - RMAppManager processing event for application_1556612756829_0001 of type APP_COMPLETED
2019-05-15 10:28:02,934 DEBUG service.AbstractService (AbstractService.java:enterState(443)) - Service: org.apache.hadoop.yarn.client.api.AppAdminClient entered state INITED
2019-05-15 10:28:02,934 DEBUG service.CompositeService (CompositeService.java:addService(74)) - Adding service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl
2019-05-15 10:28:02,934 DEBUG service.CompositeService (CompositeService.java:serviceInit(105)) - org.apache.hadoop.yarn.client.api.AppAdminClient: initing services, size=1
2019-05-15 10:28:02,934 DEBUG service.AbstractService (AbstractService.java:enterState(443)) - Service: org.apache.hadoop.yarn.client.api.impl.YarnClientImpl entered state INITED
2019-05-15 10:28:02,934 DEBUG service.CompositeService (CompositeService.java:serviceStart(116)) - org.apache.hadoop.yarn.client.api.AppAdminClient: starting services, size=1
2019-05-15 10:28:02,935 INFO  client.RMProxy (RMProxy.java:newProxyInstance(133)) - Connecting to ResourceManager at test.io/10.1.1.1:8050
2019-05-15 10:28:02,935 DEBUG security.UserGroupInformation (UserGroupInformation.java:logPrivilegedAction(1756)) - PrivilegedAction as:yarn (auth:SIMPLE) from:org.apache.hadoop.yarn.client.RMProxy.getProxy(RMProxy.java:147)
2019-05-15 10:28:02,935 DEBUG ipc.YarnRPC (YarnRPC.java:create(59)) - Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2019-05-15 10:28:02,935 DEBUG ipc.HadoopYarnProtoRPC (HadoopYarnProtoRPC.java:getProxy(47)) - Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.yarn.api.ApplicationClientProtocol
2019-05-15 10:28:02,936 DEBUG ipc.Client (ClientCache.java:getClient(63)) - getting client out of cache: org.apache.hadoop.ipc.Client@560cbf1a
2019-05-15 10:28:02,936 DEBUG service.AbstractService (AbstractService.java:start(197)) - Service org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
2019-05-15 10:28:02,936 DEBUG service.AbstractService (AbstractService.java:start(197)) - Service org.apache.hadoop.yarn.client.api.AppAdminClient is started
2019-05-15 10:28:02,936 DEBUG service.AbstractService (AbstractService.java:enterState(443)) - Service: org.apache.hadoop.yarn.client.api.AppAdminClient entered state INITED
2019-05-15 10:28:02,936 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs
2019-05-15 10:28:02,936 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3268)) - looking for configuration option fs.hdfs.impl
2019-05-15 10:28:02,937 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for implementation class
2019-05-15 10:28:02,937 DEBUG fs.FileSystem (FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class org.apache.hadoop.hdfs.DistributedFileSystem
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf (DfsClientConf.java:<init>(673)) - dfs.client.use.legacy.blockreader.local = false
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf (DfsClientConf.java:<init>(675)) - dfs.client.read.shortcircuit = true
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf (DfsClientConf.java:<init>(677)) - dfs.client.domain.socket.data.traffic = false
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf (DfsClientConf.java:<init>(679)) - dfs.domain.socket.path = /var/lib/hadoop-hdfs/dn_socket
2019-05-15 10:28:02,938 DEBUG hdfs.DFSClient (DFSClient.java:<init>(316)) - Sets dfs.client.block.write.replace-datanode-on-failure.min-replication to 0

// 3. SystemServiceManagerImpl.serviceStart
//
2019-05-15 10:28:03,424 INFO  client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:scanForUserServices(281)) - Scan for launch type on /services
2019-05-15 10:28:03,427 DEBUG service.AbstractService (AbstractService.java:noteFailure(257)) - noteFailure {}java.io.IOException: Filesystem closed
2019-05-15 10:28:03,427 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl failed in state STARTED
java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
    at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)

{code}

> RM filed to start due to system services
> ----------------------------------------
>
>                 Key: YARN-9521
>                 URL: https://issues.apache.org/jira/browse/YARN-9521
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: kyungwan nam
>            Priority: Major
>         Attachments: YARN-9521.001.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: Filesystem closed
>         at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
>         at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
>         at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1217)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1233)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1200)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
>         at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
>         at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
>         at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
>         at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
>         at org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>         ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org