You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "NedaMaleki (JIRA)" <ji...@apache.org> on 2019/05/05 08:04:01 UTC

[jira] [Commented] (YARN-1021) Yarn Scheduler Load Simulator

    [ https://issues.apache.org/jira/browse/YARN-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833259#comment-16833259 ] 

NedaMaleki commented on YARN-1021:
----------------------------------

*Dear Wei Yan,*

*I use hadoop 2.4.1. When I want to run SLS, I face with the same problem as YukunTsang:*

19/05/05 11:54:45 INFO capacity.CapacityScheduler: Added node a2116.smile.com:3 clusterResource: <memory:40960, vCores:40>
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
    at org.apache.hadoop.yarn.sls.SLSRunner.startAMFromRumenTraces(SLSRunner.java:394)
    at org.apache.hadoop.yarn.sls.SLSRunner.startAM(SLSRunner.java:246)
    at org.apache.hadoop.yarn.sls.SLSRunner.start(SLSRunner.java:141)
    at org.apache.hadoop.yarn.sls.SLSRunner.main(SLSRunner.java:524)
Caused by: java.lang.NullPointerException
    at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:123)
    ... 4 more

*After waiting some minutes I got the following messages and then nothing :(*

19/05/05 12:06:03 INFO util.AbstractLivelinessMonitor: Expired:a2115.smile.com:0 Timed out after 600 secs
19/05/05 12:06:03 INFO util.AbstractLivelinessMonitor: Expired:a2118.smile.com:1 Timed out after 600 secs
19/05/05 12:06:03 INFO util.AbstractLivelinessMonitor: Expired:a2117.smile.com:2 Timed out after 600 secs
19/05/05 12:06:03 INFO util.AbstractLivelinessMonitor: Expired:a2116.smile.com:3 Timed out after 600 secs
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: Deactivating Node a2115.smile.com:0 as it is now LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: a2115.smile.com:0 Node Transitioned from RUNNING to LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: Deactivating Node a2118.smile.com:1 as it is now LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: a2118.smile.com:1 Node Transitioned from RUNNING to LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: Deactivating Node a2117.smile.com:2 as it is now LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: a2117.smile.com:2 Node Transitioned from RUNNING to LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: Deactivating Node a2116.smile.com:3 as it is now LOST
19/05/05 12:06:03 INFO rmnode.RMNodeImpl: a2116.smile.com:3 Node Transitioned from RUNNING to LOST
19/05/05 12:06:03 INFO capacity.CapacityScheduler: Removed node a2115.smile.com:0 clusterResource: <memory:30720, vCores:30>
19/05/05 12:06:03 INFO capacity.CapacityScheduler: Removed node a2118.smile.com:1 clusterResource: <memory:20480, vCores:20>
19/05/05 12:06:03 INFO capacity.CapacityScheduler: Removed node a2117.smile.com:2 clusterResource: <memory:10240, vCores:10>
19/05/05 12:06:03 INFO capacity.CapacityScheduler: Removed node a2116.smile.com:3 clusterResource: <memory:0, vCores:0>

*I noticed when it reaches to <memory:40960, vCores:40>, it shoots the exception and I do not know why.*

 *1) I am looking forward to hear from you as I stuck here!*

*2) My second question is that, where I can extend SLS i.e. where shall I write my scheduler code in SLS, run it, and get results? (I need to simulate my scheduler and then compare it with other schedulers like FIFO, Fair, and Capacity)*

*Thanks a lot,*

 *Neda*

> Yarn Scheduler Load Simulator
> -----------------------------
>
>                 Key: YARN-1021
>                 URL: https://issues.apache.org/jira/browse/YARN-1021
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: scheduler
>            Reporter: Wei Yan
>            Assignee: Wei Yan
>            Priority: Major
>             Fix For: 2.3.0
>
>         Attachments: YARN-1021-demo.tar.gz, YARN-1021-images.tar.gz, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.patch, YARN-1021.pdf
>
>
> The Yarn Scheduler is a fertile area of interest with different implementations, e.g., Fifo, Capacity and Fair  schedulers. Meanwhile, several optimizations are also made to improve scheduler performance for different scenarios and workload. Each scheduler algorithm has its own set of features, and drives scheduling decisions by many factors, such as fairness, capacity guarantee, resource availability, etc. It is very important to evaluate a scheduler algorithm very well before we deploy it in a production cluster. Unfortunately, currently it is non-trivial to evaluate a scheduling algorithm. Evaluating in a real cluster is always time and cost consuming, and it is also very hard to find a large-enough cluster. Hence, a simulator which can predict how well a scheduler algorithm for some specific workload would be quite useful.
> We want to build a Scheduler Load Simulator to simulate large-scale Yarn clusters and application loads in a single machine. This would be invaluable in furthering Yarn by providing a tool for researchers and developers to prototype new scheduler features and predict their behavior and performance with reasonable amount of confidence, there-by aiding rapid innovation.
> The simulator will exercise the real Yarn ResourceManager removing the network factor by simulating NodeManagers and ApplicationMasters via handling and dispatching NM/AMs heartbeat events from within the same JVM.
> To keep tracking of scheduler behavior and performance, a scheduler wrapper will wrap the real scheduler.
> The simulator will produce real time metrics while executing, including:
> * Resource usages for whole cluster and each queue, which can be utilized to configure cluster and queue's capacity.
> * The detailed application execution trace (recorded in relation to simulated time), which can be analyzed to understand/validate the  scheduler behavior (individual jobs turn around time, throughput, fairness, capacity guarantee, etc).
> * Several key metrics of scheduler algorithm, such as time cost of each scheduler operation (allocate, handle, etc), which can be utilized by Hadoop developers to find the code spots and scalability limits.
> The simulator will provide real time charts showing the behavior of the scheduler and its performance.
> A short demo is available http://www.youtube.com/watch?v=6thLi8q0qLE, showing how to use simulator to simulate Fair Scheduler and Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org