You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Szilard Nemeth (Jira)" <ji...@apache.org> on 2020/12/22 15:54:00 UTC
[jira] [Commented] (YARN-10427) Duplicate Job IDs in SLS output

    [ https://issues.apache.org/jira/browse/YARN-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253589#comment-17253589 ] 

Szilard Nemeth commented on YARN-10427:
---------------------------------------

Hi [~werd.up],
 Thanks for reporting this issue and congratulations for the first reported Hadoop YARN jira.
{quote}In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output.
{quote}
I read through your observations and spent some time to play around with SLS.

If you encountered other issues, please report other jiras if you have some time.

As the process of running SLS involved some repetitive tasks like uploading configs to the remote machine, launch SLS, save the resulted logs..., I created some scripts into my public Github repo here: [https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427]

Let me break summarize what are these scripts are doing: 
 1. [config dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/config]: This is the exact same configuration file set that you attached to this jira, with one exception of the log4j.properties file, that turns on DEBUG logging for SLS.

2. [upstream-patches dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches]: This is the directory of the logging patch that helped me see the issues more clearly.
 My code changes are also pushed to my Hadoop fork: [https://github.com/szilard-nemeth/hadoop/tree/YARN-10427-investigation]

3. [scripts dir|https://github.com/szilard-nemeth/linux-env/tree/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts]: This is the directory that contains all my scripts to build Hadoop + launch SLS and save produced logs to the local machine.
 As I have been working on a remote cluster, there's a script called [setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh] that contains some configuration values for the remote cluster + some local directories. If you want to use the scripts, all you need to do is to replace the configs in this file according to your environment.

3.1 [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh]: This is the script that builds Hadoop according to the environment variables and launches the SLS suite on the remote cluster.

3.2 [start-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/start-sls.sh]: This is the most important script as this will be executed on the remote machine. 
 I think the script itself is straightforward enough, but let me briefly list what it does:
 - This script assumes that the Hadoop dist package is copied to the remote machine (this was done by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh])
 - Cleans up all Hadoop-related directories and extracts the Hadoop dist tar.gz
 - Copies the config to Hadoop's config dirs so SLS will use these particular configs
 - Launches SLS by starting slsrun.sh with the appropriate CLI swithces
 - Greps for some useful data in the resulted SLS log file.

3.3 [launch-sls.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/launch-sls.sh]: This script is executed by [build-and-launch.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/build-and-launch-sls.sh] as its last step. Once the start-sls.sh is finished, the [save-latest-sls-logs.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/save-latest-sls-logs.sh] script is started. As the name implies it saves the latest SLS log dir and SCPs it to the local machine. The target directory of the local machine is determined by the config ([setup-vars-upstream.sh|https://github.com/szilard-nemeth/linux-env/blob/ff84652b34bc23c1f88766f781f6648365becde5/workplace-specific/cloudera/investigations/YARN-10427/scripts/setup-vars-upstream.sh]).

*The latest logs and grepped logs for the SLS run is saved to my repo [here.|https://github.com/szilard-nemeth/linux-env/tree/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513]*
h2. What causes the duplicate Job IDs

1. The jobruntime.csv file is being written with class SchedulerMetrics, you can see the init part [here|https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java#L180-L186].

2. The jobruntime records (lines of CSV file) are written with method [SchedulerMetrics#addAMRuntime|https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java#L661-L674]. We only need to check the call hierarchy of this method to reveal the reason of duplicate application IDs.

*2.1 Call hierarchy #1 (From bottom to top):*
{code:java}
org.apache.hadoop.yarn.sls.scheduler.SchedulerMetrics#addAMRuntime
  org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep
    org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator#lastStep
      org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator#processResponseQueue{code}
*2.2 Call hierarchy #2 (From bottom to top):*
{code:java}
org.apache.hadoop.yarn.sls.scheduler.SchedulerMetrics#addAMRuntime
  org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep
    org.apache.hadoop.yarn.sls.scheduler.TaskRunner.Task#run 
{code}
3. These duplicate calls of MRAMSimulator#lastStep can be easily justified with the logs as well. [apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/0d41e4dbda5e3a22105c4fe27f540ae8004857fe/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/apps-shuttingdown.log]
 In this logfile, it's clearly visible that 9 apps (application_1608638719822_0001 - application_1608638719822_0009) are "shutting down" 2 times. 
 This is because the MRAMSimulator#lastStep is called twice.
 As MRAMSimulator#lastStep calls org.apache.hadoop.yarn.sls.appmaster.AMSimulator#lastStep (super method), I added some logging that prints the stacktrace of lastStep method calls: [AMSimulator#lastStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L223-L225].

Let's take application_1608638719822_0001 as an example with this file: [laststep-calls-for-app0001.log|https://github.com/szilard-nemeth/linux-env/blob/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/laststep-calls-for-app0001.log]

4. Checking the 2 stacktraces:

*4.1 Stacktrace #1: Call to lastStep from MRAMSimulator#processResponseQueue, when all mappers/reducers are finished:*
{code:java}
at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
	at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
	at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.processResponseQueue(MRAMSimulator.java:195)
	at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
	at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:101)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}
[TaskRunner$Task.run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L101] calls AMSimulator#middleStep.
 Then, in [MRAMSimulator.processResponseQueue|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java#L194-L196], there's a code piece that checks for completed mappers and reducers. 
 If the finished mappers are greater than or equal to all mappers and same with reducers, the lastStep will be called.
{code:java}
if (mapFinished >= mapTotal && reduceFinished >= reduceTotal) {
  lastStep();
}
{code}
*Stacktrace #2: Call to lastStep from [TaskRunner$Task.run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L89-L113]*
{code:java}
	at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
	at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
	at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:106)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}
According my code inspections, all NMs and AMs are scheduled with this TaskRunner from SLSRunner.
 The call hierarchy of a launch of an AM is this (from bottom to top):

TaskRunner.schedule(Task) (org.apache.hadoop.yarn.sls.scheduler)
{code:java}
SLSRunner.runNewAM(String, String, String, String, long, long, List<ContainerSimulator>, ...) (org.apache.hadoop.yarn.sls)
  SLSRunner.runNewAM(String, String, String, String, long, long, List<ContainerSimulator>, ...) (org.apache.hadoop.yarn.sls)
    SLSRunner.createAMForJob(Map) (org.apache.hadoop.yarn.sls)
      SLSRunner.startAMFromSLSTrace(String) (org.apache.hadoop.yarn.sls)
        SLSRunner.startAM() (org.apache.hadoop.yarn.sls)
          SLSRunner.start() (org.apache.hadoop.yarn.sls)
            SLSRunner.run(String[]) (org.apache.hadoop.yarn.sls){code}
As an implementation of the AM is class of [AMSimulator|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java] that extends TaskRunner.Task, that implements the Runnable interface, all interesting things are happening in [org.apache.hadoop.yarn.sls.scheduler.TaskRunner.Task#run|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java#L89-L113].
 Initially, the field _nextTime_ is equal to _startTime_, so the firstStep method is invoked.
 For subsequent calls of run and while _nextRun_ < _endTime_, middleStep is executed.
 The field called '_nextRun_' is always incremented with the value of _repeatInterval_ (which is 1000ms with the default config). 
 This means that all AMSimulator tasks are getting scheduled in every second.
 Once '_nextRun_' reaches '_endTime_' (it becomes greater) then lastStep will be called.
h2. Conclusion for duplicate Job IDs

These 2 calls to lastStep are the main reason of the duplicate applicationID in the jobruntime.csv file.
 It's not trivial for me why this lastStep method is invoked through [AMSimulator#middleStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L209] and ultimately through [AMSimulator#processResponseQueue|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L212] and from the main loop of the TaskRunner$Task.
*I suppose this method should be invoked only once per AM!*

What is even more interesting that 9 out of 10 apps had this method called twice according to this log file: [apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/0d41e4dbda5e3a22105c4fe27f540ae8004857fe/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/apps-shuttingdown.log]
 .
 But for the last application it is only called once:
{code:java}
2020-12-22 04:09:47,892 INFO appmaster.AMSimulator: Application application_1608638719822_0010 is shutting down. lastStep Stacktrace
{code}
All I can see is that the only call to lastStep for app 0010 is this:
 (This is from [log file|https://raw.githubusercontent.com/szilard-nemeth/linux-env/master/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/output.log])
{code:java}
2020-12-22 04:09:47,892 INFO appmaster.AMSimulator: Application application_1608638719822_0010 is shutting down. lastStep Stacktrace
java.lang.Exception
	at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.lastStep(AMSimulator.java:224)
	at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.lastStep(MRAMSimulator.java:401)
	at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.processResponseQueue(MRAMSimulator.java:195)
	at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
	at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:101)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}
_*This is the call from MRAMSimulator.processResponseQueue that verifies the number of completed mappers/reducers.*_
 _*The other call that checks the timestamps in TaskRunner$Task.run is not called, meaning that the last application never reaches its intended running time.*_
 _*This could be counted as "another bug", but unfortunately I wasn't be able to find out why this anomaly happens.*_
h2. Other observations

If I grep for any container ID that belongs to any of the 9 applications that had duplicate Job IDs in the jobruntime.csv file, each of the apps have a log record like this in the output.log:
{code:java}
2020-12-22 04:07:11,980 INFO scheduler.AbstractYarnScheduler: Container container_1608638719822_0001_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
{code}
[See an example here|https://github.com/szilard-nemeth/linux-env/blob/96ed3d8af9f4677866652bb57153713b29f24a98/workplace-specific/cloudera/investigations/YARN-10427/latest-logs/slsrun-out-20201222_040513/grepped/container_1608638719822_0001_01_000001.log#L32]
 I think this is also happening because of the duplicate call to the lastStep method.
h2. Possible fix for duplicate Job IDs

The task is to prevent lastStep to be called twice.
 Without understanding the reason of the two calls above and the potential side-effects of the removal of any of these calls, let's check what lastStep does.
 The implementation of lastStep for MRAMSimulator delegates to the superclass: [AMSimulator#lastStep|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L222-L273].
 *There are several things happening in this method:*
 - App is unregistered / untracked.

 - If the amContainer is not null, the NM of the AM will be notified and the AM container will be marked as completed [here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L231-L238]

 - The AM is unregistered from the RM [here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L246-L263].

 - The finish time of the AM is set, this is the only write access of this field: [here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L265].

 - The job's runtime information will be persisted to the jobruntime.csv file [here|https://github.com/szilard-nemeth/hadoop/blob/10d9d9ff3446583b3b2b6e4518ad0c3ea335da48/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java#L266-L272].

*I think all of these actions must be prevented to be called more than once!*

As there are only one update of a field in the lastStep method, without introducing a new boolean flag to track if lastStep was called or not, a quick and dirty solution is to check if the _org.apache.hadoop.yarn.sls.appmaster.AMSimulator#simulateFinishTimeMS_ field is modified (i.e. greater then zero, which is the default value of long fields). As the only writer of this field is one write occurrence from the lastStep method, it's safe to check this. If it is non-zero or greater than zero, it means lastStep was called before.
h2. Test run with the fix

The fix patch is added [here|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/upstream-patches/0002-YARN-10427-Prevent-second-call-of-AMSimulator-lastSt.patch]
 It is also uploaded as an attachment to this jira, as a candidate for commit as I think it's a proper fix.
 The logs of the "fixed run" can be found here: [https://github.com/szilard-nemeth/linux-env/tree/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs]

1. The shutting down messages for applications look way better, there's only 10 messages and 10 apps, which is correct: [apps-shuttingdown.log|https://github.com/szilard-nemeth/linux-env/blob/master/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/grepped/apps-shuttingdown.log]

2. The [jobruntime.csv|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/jobruntime.csv] file also looks good. There's one entry per application now.

3. In the [output.log|https://github.com/szilard-nemeth/linux-env/blob/9bd94311a900b79764d2ee26db16aed312a7fff7/workplace-specific/cloudera/investigations/YARN-10427/fixed-logs/output.log] file, there are still weird messages when the AM container is finished, for all the apps:
{code:java}
root@snemeth-fips2-1 slsrun-out-20201222_063242]# grep "but corresponding RMContainer doesn't exist" output.log 
2020-12-22 06:34:40,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0002_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:34:41,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0001_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:35:05,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0003_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:35:10,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0005_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:35:30,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0006_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:36:04,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0009_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:36:04,373 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0008_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:36:20,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0004_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
2020-12-22 06:36:26,315 INFO scheduler.AbstractYarnScheduler: Container container_1608647568797_0007_01_000001 completed with event FINISHED, but corresponding RMContainer doesn't exist.
{code}
So in contrary to my expectations, this is not because of the double-call of lastStep.

> Duplicate Job IDs in SLS output
> -------------------------------
>
>                 Key: YARN-10427
>                 URL: https://issues.apache.org/jira/browse/YARN-10427
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler-load-simulator
>    Affects Versions: 3.0.0, 3.3.0, 3.2.1, 3.4.0
>         Environment: I ran the attached inputs on my MacBook Pro, using Hadoop compiled from the latest trunk (as of commit 139a43e98e). I also tested against 3.2.1 and 3.3.0 release branches.
>  
>            Reporter: Drew Merrill
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: fair-scheduler.xml, inputsls.json, jobruntime.csv, jobruntime.csv, mapred-site.xml, sls-runner.xml, yarn-site.xml
>
>
> Hello, I'm hoping someone can help me resolve or understand some issues I've been having with the YARN Scheduler Load Simulator (SLS). I've been experimenting with SLS for several months now at work as we're trying to build a simulation model to characterize our enterprise Hadoop infrastructure for purposes of future capacity planning. In the process of attempting to verify and validate the SLS output, I've encountered a number of issues including runtime exceptions and bad output. The focus of this issue is the bad output. In all my simulation runs, the jobruntime.csv output seems to have one or more of the following problems: no output, duplicate job ids, and/or missing job ids.
>  
> Because of where I work, I'm unable to provide the exact inputs I typically use, but I'm able to reproduce the problem of the duplicate Job IDS using some simplified inputs and configuration files, which I've attached, along with the output I obtained.
>  
> The command I used to run the simulation:
> {{./runsls.sh --tracetype=SLS --tracelocation=./inputsls.json --output-dir=sls-run-1 --print-simulation --track-jobs=job_1,job_2,job_3,job_4,job_5,job_6,job_7,job_8,job_9,job_10}}
>  
> Can anyone help me understand what would cause the duplicate Job IDs in the output? Is this a bug in Hadoop or a problem with my inputs? Thanks in advance.
>  
> PS: This is my first issue I've ever opened so please be kind if I've missed something or am not understanding something obvious about the way Hadoop works. I'll gladly follow-up with more info as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org