You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2018/03/08 15:46:00 UTC

[jira] [Commented] (YARN-8014) YARN ResourceManager Lists A NodeManager As RUNNING & SHUTDOWN Simultaneously

    [ https://issues.apache.org/jira/browse/YARN-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391417#comment-16391417 ] 

Jason Lowe commented on YARN-8014:
----------------------------------

I believe this is an artifact of the NM appearing to be two separate instances of nodemanagers.  Note that the NM port changed between the two instances.  It originally was rb0101.local:43892 but became rb0101.local:42627 after it was restarted.  That explains why the node shows up twice when listing all nodes.  The RM did not understand that the newly joining NM at port 42627 was supposed to be the same one that was at port 43892.  The RM does not preclude multiple NMs running at the same node, and indeed that's how the mini clusters used for unit tests can run multiple NMs with only one host.

It is surprising that the shutdown NM instance does not appear when explicitly asking for nodes in the shutdown state.  I suspect somewhere in the RM's bookkeeping it is dropping the port distinction and the RUNNING instance ends up superceding the SHUTDOWN one for that query.

Simplest workaround for this is to use a fixed port for the NM.  Then the RM will understand that the new node joining is the same node that left previously.  That also has the benefit of precluding an accidental double-startup of an NM on a node which is not going to go well if not configured intentionally for that scenario.  Both NMs will think they have control of the node's resources and end up using far more resources on the node than intended.


> YARN ResourceManager Lists A NodeManager As RUNNING & SHUTDOWN Simultaneously
> -----------------------------------------------------------------------------
>
>                 Key: YARN-8014
>                 URL: https://issues.apache.org/jira/browse/YARN-8014
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.2
>            Reporter: Evan Tepsic
>            Priority: Minor
>
> A graceful shutdown & then startup of a NodeManager process using YARN/HDFS v2.8.2 seems to successfully place the Node back into RUNNING state. However, ResouceManager appears to keep the Node also in SHUTDOWN state.
>  
> *Steps To Reproduce:*
> 1. SSH to host running NodeManager.
>  2. Switch-to UserID that NodeManager is running as (hadoop).
>  3. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
>  4. Wait for NodeManager process to terminate gracefully.
>  5. Confirm Node is in SHUTDOWN state via: [http://rb01rm01.local:8088/cluster/nodes]
>  6. Execute cmd: /opt/hadoop/sbin/yarn-daemon.sh stop nodemanager
>  7. Confirm Node is in RUNNING state via: [http://rb01rm01.local:8088/cluster/nodes]
>  
> *Investigation:*
>  1. Review contents of ResourceManager + NodeManager log-files:
> +ResourceManager log-[file:+|file:///+]
>  2018-03-08 08:15:44,085 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node with node id : rb0101.local:43892 has shutdown, hence unregistering the node.
>  2018-03-08 08:15:44,092 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node rb0101.local:43892 as it is now SHUTDOWN
>  2018-03-08 08:15:44,092 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: rb0101.local:43892 Node Transitioned from RUNNING to SHUTDOWN
>  2018-03-08 08:15:44,093 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node rb0101.local:43892 cluster capacity: <memory:110592, vCores:54>
>  2018-03-08 08:16:08,915 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node rb0101.local(cmPort: 42627 httpPort: 8042) registered with capability: <memory:12288, vCores:6>, assigned nodeId rb0101.local:42627
>  2018-03-08 08:16:08,916 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: rb0101.local:42627 Node Transitioned from NEW to RUNNING
>  2018-03-08 08:16:08,916 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node rb0101.local:42627 cluster capacity: <memory:122880, vCores:60>
>  2018-03-08 08:16:34,826 WARN org.apache.hadoop.ipc.Server: Large response size 2976014 for call Call#428958 Retry#0 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplications from 192.168.1.100:44034
>  
> +NodeManager log-[file:+|file:///+]
>  2018-03-08 08:00:14,500 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
>  Deleted: 0, Private Deleted: 0
>  2018-03-08 08:10:14,498 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 10720046250, Total Deleted: 0, Public
>  Deleted: 0, Private Deleted: 0
>  2018-03-08 08:15:44,048 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
>  2018-03-08 08:15:44,101 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Successfully Unregistered the Node rb0101.local:43892 with ResourceManager.
>  2018-03-08 08:15:44,114 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
>  2018-03-08 08:15:44,226 INFO org.apache.hadoop.ipc.Server: Stopping server on 43892
>  2018-03-08 08:15:44,232 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 43892
>  2018-03-08 08:15:44,237 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
>  2018-03-08 08:15:44,239 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logag
>  gregation.LogAggregationService waiting for pending aggregation during exit
>  2018-03-08 08:15:44,242 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Cont
>  ainersMonitorImpl is interrupted. Exiting.
>  2018-03-08 08:15:44,284 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
>  2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
>  2018-03-08 08:15:44,285 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
>  2018-03-08 08:15:44,287 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
>  2018-03-08 08:15:44,289 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
>  2018-03-08 08:15:44,294 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
>  2018-03-08 08:15:44,295 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
>  2018-03-08 08:15:44,296 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
>  2018-03-08 08:15:44,297 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
>  /************************************************************
>  SHUTDOWN_MSG: Shutting down NodeManager at rb0101.local/192.168.1.101
>  ************************************************************/
>  2018-03-08 08:16:01,905 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
>  /************************************************************
>  STARTUP_MSG: Starting NodeManager
>  STARTUP_MSG: user = hadoop
>  STARTUP_MSG: host = rb0101.local/192.168.1.101
>  STARTUP_MSG: args = []
>  STARTUP_MSG: version = 2.8.2
>  STARTUP_MSG: classpath = blahblahblah (truncated for size-purposes)
>  STARTUP_MSG: build = Unknown -r Unknown; compiled by 'root' on 2017-09-14T18:22Z
>  STARTUP_MSG: java = 1.8.0_144
>  ************************************************************/
>  2018-03-08 08:16:01,918 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal handlers for [TERM, HUP, INT]
>  2018-03-08 08:16:03,202 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Node Manager health check script is not available or doesn't have execute permission, so not starting the
>  node health script runner.
>  2018-03-08 08:16:03,321 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType for class
>  org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher
>  2018-03-08 08:16:03,322 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType for c
>  lass org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher
>  2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType
>  for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService
>  2018-03-08 08:16:03,323 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType for class org.apa
>  che.hadoop.yarn.server.nodemanager.containermanager.AuxServices
>  2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType for
>  class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  2018-03-08 08:16:03,324 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType f
>  or class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher
>  2018-03-08 08:16:03,347 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.ContainerManagerEventType for class org.apache.hadoop.y
>  arn.server.nodemanager.containermanager.ContainerManagerImpl
>  2018-03-08 08:16:03,348 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.NodeManagerEventType for class org.apache.hadoop.yarn.s
>  erver.nodemanager.NodeManager
>  2018-03-08 08:16:03,402 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
>  2018-03-08 08:16:03,484 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
>  2018-03-08 08:16:03,484 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system started
>  2018-03-08 08:16:03,561 INFO org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@4b8729f
>  f
>  2018-03-08 08:16:03,564 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType f
>  or class org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
>  2018-03-08 08:16:03,565 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploa
>  dEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService
>  2018-03-08 08:16:03,565 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: AMRMProxyService is disabled
>  2018-03-08 08:16:03,566 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: per directory file limit = 8192
>  2018-03-08 08:16:03,621 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: usercache path : [file:/space/hadoop/tmp/nm-local-dir/usercache_|file:///space/hadoop/tmp/nm-local-dir/usercache_]
>  DEL_1520518563569
>  2018-03-08 08:16:03,667 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user1]
>  2018-03-08 08:16:03,667 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user2]
>  2018-03-08 08:16:03,668 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user3]
>  2018-03-08 08:16:03,681 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : [file:/space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4|file:///space/hadoop/tmp/nm-local-dir/usercache_DEL_1520518563569/user4]
>  2018-03-08 08:16:03,739 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
>  2018-03-08 08:16:03,793 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service mapreduce_shuffle, "mapreduce_shuffle"
>  2018-03-08 08:16:03,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@1187c9e8
>  2018-03-08 08:16:03,826 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Using ResourceCalculatorProcessTree : null
>  2018-03-08 08:16:03,827 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Physical memory check enabled: true
>  2018-03-08 08:16:03,827 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Virtual memory check enabled: true
>  2018-03-08 08:16:03,832 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: ContainersMonitor enabled: true
>  2018-03-08 08:16:03,841 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager resources: memory set to 12288MB.
>  2018-03-08 08:16:03,841 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager resources: vcores set to 6.
>  2018-03-08 08:16:03,846 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager with : physical-memory=12288 virtual-memory=25805 virtual-cores=6
>  2018-03-08 08:16:03,850 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting JVM pause monitor
>  2018-03-08 08:16:03,908 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 2000 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
>  2018-03-08 08:16:03,932 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 42627
>  2018-03-08 08:16:04,153 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.api.ContainerManagementProtocolPB to the server
>  2018-03-08 08:16:04,153 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting.
>  2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
>  2018-03-08 08:16:04,154 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 42627: starting
>  2018-03-08 08:16:04,166 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : rb0101.local:42627
>  2018-03-08 08:16:04,183 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler
>  2018-03-08 08:16:04,184 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8040
>  2018-03-08 08:16:04,191 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server
>  2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
>  2018-03-08 08:16:04,191 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8040: starting
>  2018-03-08 08:16:04,192 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer started on port 8040
>  2018-03-08 08:16:04,312 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760
>  2018-03-08 08:16:04,330 INFO org.apache.hadoop.mapred.ShuffleHandler: mapreduce_shuffle listening on port 13562
>  2018-03-08 08:16:04,337 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at rb0101.local/192.168.1.101:42627
>  2018-03-08 08:16:04,337 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to 0.0.0.0/0.0.0.0:0
>  2018-03-08 08:16:04,340 INFO org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer: Instantiating NMWebApp at 0.0.0.0:8042
>  2018-03-08 08:16:04,427 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
>  2018-03-08 08:16:04,436 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
>  2018-03-08 08:16:04,442 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.nodemanager is not defined
>  2018-03-08 08:16:04,450 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
>  2018-03-08 08:16:04,461 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context node
>  2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
>  2018-03-08 08:16:04,462 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
>  2018-03-08 08:16:04,462 INFO org.apache.hadoop.security.HttpCrossOriginFilterInitializer: CORS filter not enabled. Please set hadoop.http.cross-origin.enabled to 'true' to enable it
>  2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /node/*
>  2018-03-08 08:16:04,465 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /ws/*
>  2018-03-08 08:16:04,843 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules
>  2018-03-08 08:16:04,846 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 8042
>  2018-03-08 08:16:04,846 INFO org.mortbay.log: jetty-6.1.26
>  2018-03-08 08:16:04,877 INFO org.mortbay.log: Extract jar:[file:/opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node|file:///opt/hadoop-2.8.2/share/hadoop/yarn/hadoop-yarn-common-2.8.2.jar!/webapps/node] to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp
>  2018-03-08 08:16:08,355 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
>  2018-03-08 08:16:08,356 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app node started at 8042
>  2018-03-08 08:16:08,473 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node ID assigned is : rb0101.local:42627
>  2018-03-08 08:16:08,498 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at rb01rm01.local/192.168.1.100:8031
>  2018-03-08 08:16:08,613 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
>  2018-03-08 08:16:08,621 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
>  2018-03-08 08:16:08,934 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -2086472604
>  2018-03-08 08:16:08,938 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id -426187560
>  2018-03-08 08:16:08,939 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as rb0101.local:42627 with total resource of <memory:12288, vCores:6>
>  2018-03-08 08:16:08,939 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests
>  2018-03-08 08:26:04,174 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
>  2018-03-08 08:36:04,170 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
>  2018-03-08 08:46:04,170 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
> 2. Listing all of YARN's Nodes, we can see it was returned to the RUNNING state. However, when listing all nodes, it shows the node in 2 states; RUNNING and SHUTDOWN:
> [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -all
>  18/03/08 09:20:33 INFO client.RMProxy: Connecting to ResourceManager at rb01rm01.local/192.168.1.100:8032
>  18/03/08 09:20:34 INFO client.AHSProxy: Connecting to Application History server at rb01rm01.local/192.168.1.100:10200
>  Total Nodes:11
>  Node-Id Node-State Node-Http-Address Number-of-Running-Containers
>  rb0106.local:44160 RUNNING rb0106.local:8042 0
>  rb0105.local:32832 RUNNING rb0105.local:8042 0
>  rb0101.local:42627 RUNNING rb0101.local:8042 0
>  rb0108.local:38209 RUNNING rb0108.local:8042 0
>  rb0107.local:34306 RUNNING rb0107.local:8042 0
>  rb0102.local:43063 RUNNING rb0102.local:8042 0
>  rb0103.local:42374 RUNNING rb0103.local:8042 0
>  rb0109.local:37455 RUNNING rb0109.local:8042 0
>  rb0110.local:36690 RUNNING rb0110.local:8042 0
>  rb0104.local:33268 RUNNING rb0104.local:8042 0
>  rb0101.local:43892 SHUTDOWN rb0101.local:8042 0
>  [hadoop@rb01rm01 logs]$
> [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states RUNNING
>  18/03/08 09:20:55 INFO client.RMProxy: Connecting to ResourceManager at rb01rm01.local/192.168.1.100:8032
>  18/03/08 09:20:56 INFO client.AHSProxy: Connecting to Application History server at rb01rm01.local/192.168.1.100:10200
>  Total Nodes:10
>  Node-Id Node-State Node-Http-Address Number-of-Running-Containers
>  rb0106.local:44160 RUNNING rb0106.local:8042 0
>  rb0105.local:32832 RUNNING rb0105.local:8042 0
>  rb0101.local:42627 RUNNING rb0101.local:8042 0
>  rb0108.local:38209 RUNNING rb0108.local:8042 0
>  rb0107.local:34306 RUNNING rb0107.local:8042 0
>  rb0102.local:43063 RUNNING rb0102.local:8042 0
>  rb0103.local:42374 RUNNING rb0103.local:8042 0
>  rb0109.local:37455 RUNNING rb0109.local:8042 0
>  rb0110.local:36690 RUNNING rb0110.local:8042 0
>  rb0104.local:33268 RUNNING rb0104.local:8042 0
>  [hadoop@rb01rm01 logs]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
>  18/03/08 09:21:01 INFO client.RMProxy: Connecting to ResourceManager at rb01rm01.local/192.168.1.100:8032
>  18/03/08 09:21:01 INFO client.AHSProxy: Connecting to Application History server at rb01rm01.local/192.168.1.100:10200
>  Total Nodes:0
>  Node-Id Node-State Node-Http-Address Number-of-Running-Containers
>  [hadoop@rb01rm01 logs]$
> 3. ResourceManager however, does not list Node rb0101.local as SHUTDOWN when specifically requesting list of Nodes in SHUTDOWN state:
> [hadoop@rb01rm01 bin]$ /opt/hadoop/bin/yarn node -list -states SHUTDOWN
>  18/03/08 08:28:23 INFO client.RMProxy: Connecting to ResourceManager at rb01rm01.local/v.x.y.z:8032
>  18/03/08 08:28:24 INFO client.AHSProxy: Connecting to Application History server at rb01rm01.local/v.x.y.z:10200
>  Total Nodes:0
>  Node-Id Node-State Node-Http-Address Number-of-Running-Containers
>  [hadoop@rb01rm01 bin]$



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org