You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Dmitry Sivachenko (JIRA)" <ji...@apache.org> on 2015/06/09 17:33:00 UTC

[jira] [Created] (YARN-3788) Application Master and Task Tracker timeouts are applied incorrectly

Dmitry Sivachenko created YARN-3788:
---------------------------------------

             Summary: Application Master and Task Tracker timeouts are applied incorrectly
                 Key: YARN-3788
                 URL: https://issues.apache.org/jira/browse/YARN-3788
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.4.1
            Reporter: Dmitry Sivachenko


I am running a streaming job which requires a big (~50GB) data file to run (file is attached via hadoop jar <...> -file BigFile.dat).

Most likely this command will fail as follows (note that error message is rather meaningless):
2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use generic option -files instead.
packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] /var/tmp/streamjob633547925483233845.jar tmpDir=null
2015-05-27 19:46:22,942 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at nezabudka1-00.yandex.ru/5.255.231.129:8032
2015-05-27 19:46:23,733 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at nezabudka1-00.yandex.ru/5.255.231.129:8032
2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_1431704916575_2531
2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(204)) - Submitted application application_1431704916575_2531
2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in uber mode : false
2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with state FAILED due to: Application application_1431704916575_2531 failed 2 times due to ApplicationMaster for attempt appattempt_1431704916575_2531_000002 timed out. Failing the application.
2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 0
2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
Streaming Command Failed!


This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 sec) timeout expires before large data file is transferred.

Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that application is successfully initialized and tasks are spawned.

But I encounter another error: the default 600 seconds mapreduce.task.timeout expires before tasks are initialized and tasks fail.

Error message Task attempt_XXX failed to report status for 600 seconds is also misleading: this timeout is supposed to kill non-responsive (stuck) tasks but it rather strikes because auxiliary data files are copying slowly.

So I need to increase mapreduce.task.timeout too and only after that my job is successful.

At the very least error messages need to be tweaked to indicate that Application (or Task) is failing because auxiliary files are not copied during that time, not just generic "timeout expired".

Better solution would be not to account time spent for data files distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)