You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Ahmed Hussein (Jira)" <ji...@apache.org> on 2020/12/11 05:29:00 UTC

[jira] [Comment Edited] (YARN-10040) DistributedShell test failure on X86 and ARM

    [ https://issues.apache.org/jira/browse/YARN-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247564#comment-17247564 ] 

Ahmed Hussein edited comment on YARN-10040 at 12/11/20, 5:28 AM:
-----------------------------------------------------------------

{quote}Abhishek Modi any pointers about this? Is the code only broken or just the test. If the functionality itself has some issue we should consider reverting YARN-9697, else if this is only a test issue, we should wrap this up, if there isn't a fix available we can disable this test for time being. Let me know what is the actual situation. I can try help in whichever way possible.{quote}

[~abmodi] Would you mind please taking a look at the failures?




was (Author: ahussein):
On iOS The {{TestDistributedShell}} does not run. But I thought to dump the error here because a NPE could be a hint to what's broken in the implementation.


{code:bash}
2020-12-10 17:29:22,129 INFO  [IPC Server listener on 8048] ipc.Server (Server.java:run(1344)) - IPC Server listener on 8048: starting
2020-12-10 17:29:22,131 INFO  [Listener at localhost/8048] collectormanager.NMCollectorService (NMCollectorService.java:serviceStart(101)) - NMCollectorService started at localhost/127.0.0.1:8048
2020-12-10 17:29:22,131 INFO  [Listener at localhost/8048] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:serviceStart(267)) - Node ID assigned is : localhost:54943
2020-12-10 17:29:22,207 INFO  [Listener at localhost/8048] resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(617)) - NodeManager from node localhost(cmPort: 54943 httpPort: 54946) registered with capability: <memory:4096, vCores:8>, assigned nodeId localhost:54943
2020-12-10 17:29:22,210 INFO  [Listener at localhost/8048] security.NMContainerTokenSecretManager (NMContainerTokenSecretManager.java:setMasterKey(143)) - Rolling master-key for container-tokens, got key with id -210390460
2020-12-10 17:29:22,210 INFO  [Listener at localhost/8048] security.NMTokenSecretManagerInNM (NMTokenSecretManagerInNM.java:setMasterKey(143)) - Rolling master-key for container-tokens, got key with id -1432443197
2020-12-10 17:29:22,210 INFO  [Listener at localhost/8048] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(486)) - Registered with ResourceManager as localhost:54943 with total resource of <memory:4096, vCores:8>
2020-12-10 17:29:22,212 INFO  [Listener at localhost/8048] delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:updateCurrentKey(367)) - Updating the current master key for generating delegation tokens
2020-12-10 17:29:22,212 INFO  [Thread[Thread-282,5,FailOnTimeoutGroup]] delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(701)) - Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
2020-12-10 17:29:22,212 INFO  [Thread[Thread-282,5,FailOnTimeoutGroup]] delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:updateCurrentKey(367)) - Updating the current master key for generating delegation tokens
2020-12-10 17:29:22,212 INFO  [RM Event dispatcher] rmnode.RMNodeImpl (RMNodeImpl.java:handle(774)) - localhost:54943 Node Transitioned from NEW to UNHEALTHY
2020-12-10 17:29:22,214 INFO  [org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService:Event Processor] distributed.NodeQueueLoadMonitor (NodeQueueLoadMonitor.java:removeNode(202)) - Node delete event for: localhost
2020-12-10 17:29:22,215 ERROR [SchedulerEventDispatcher:Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:removeNode(2127)) - Attempting to remove non-existent node localhost:54943
2020-12-10 17:29:22,215 ERROR [org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService:Event Processor] event.EventDispatcher (MarkerIgnoringBase.java:error(159)) - Error in handling event type NODE_REMOVED to the Event Dispatcher
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.distributed.NodeQueueLoadMonitor.removeFromNodeIdsByRack(NodeQueueLoadMonitor.java:405)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.distributed.NodeQueueLoadMonitor.removeNode(NodeQueueLoadMonitor.java:204)
	at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handle(OpportunisticContainerAllocatorAMService.java:399)
	at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.handle(OpportunisticContainerAllocatorAMService.java:94)
	at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:71)
	at java.lang.Thread.run(Thread.java:748)
2020-12-10 17:29:22,216 INFO  [org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService:Event Processor] event.EventDispatcher (EventDispatcher.java:run(84)) - Exiting, bbye..
2020-12-10 17:29:22,217 INFO  [Listener at localhost/8048] ipc.CallQueueManager (CallQueueManager.java:<init>(93)) - Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 1000, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, ipcBackoff: false.
2020-12-10 17:29:22,218 INFO  [Socket Reader #1 for port 0] ipc.Server (Server.java:run(1265)) - Starting Socket Reader #1 for port 0
2020-12-10 17:29:22,222 INFO  [Listener at localhost/54947] pb.RpcServerFactoryPBImpl (RpcServerFactoryPBImpl.java:createServer(174)) - Adding protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB to the server

{code}

{quote}Abhishek Modi any pointers about this? Is the code only broken or just the test. If the functionality itself has some issue we should consider reverting YARN-9697, else if this is only a test issue, we should wrap this up, if there isn't a fix available we can disable this test for time being. Let me know what is the actual situation. I can try help in whichever way possible.{quote}

[~abmodi] Would you mind please taking a look at the failures?



> DistributedShell test failure on X86 and ARM
> --------------------------------------------
>
>                 Key: YARN-10040
>                 URL: https://issues.apache.org/jira/browse/YARN-10040
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications/distributed-shell
>         Environment: X86/ARM
> OS: ubuntu1804
> Java 8
>            Reporter: zhao bo
>            Assignee: Abhishek Modi
>            Priority: Major
>         Attachments: YARN-10040.001.patch
>
>
> * org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers
>  * org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType
> Please see the Apache Jenkins Test result:
> [https://builds.apache.org/job/hadoop-multibranch/job/PR-1767/1/testReport/]
>  
> These 2 tests are failed on both X86 and ARM platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org