You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2015/06/09 18:39:00 UTC
[jira] [Commented] (YARN-3788) Application Master and Task Tracker timeouts are applied incorrectly

    [ https://issues.apache.org/jira/browse/YARN-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579188#comment-14579188 ] 

Rohith commented on YARN-3788:
------------------------------

This is MapReduce project issue/query, moving to MR for further discussion.

> Application Master and Task Tracker timeouts are applied incorrectly
> --------------------------------------------------------------------
>
>                 Key: YARN-3788
>                 URL: https://issues.apache.org/jira/browse/YARN-3788
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Sivachenko
>
> I am running a streaming job which requires a big (~50GB) data file to run (file is attached via hadoop jar <...> -file BigFile.dat).
> Most likely this command will fail as follows (note that error message is rather meaningless):
> 2015-05-27 15:55:00,754 WARN  [main] streaming.StreamJob (StreamJob.java:parseArgv(291)) - -file option is deprecated, please use generic option -files instead.
> packageJobJar: [/ssd/mt/lm/en_reorder.ylm, mapper.py, /tmp/hadoop-mitya/hadoop-unjar3778165585140840383/] [] /var/tmp/streamjob633547925483233845.jar tmpDir=null
> 2015-05-27 19:46:22,942 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 19:46:23,733 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager at nezabudka1-00.yandex.ru/5.255.231.129:8032
> 2015-05-27 20:13:37,231 INFO  [main] mapred.FileInputFormat (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
> 2015-05-27 20:13:38,110 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
> 2015-05-27 20:13:38,136 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2015-05-27 20:13:38,390 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_1431704916575_2531
> 2015-05-27 20:13:38,689 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(204)) - Submitted application application_1431704916575_2531
> 2015-05-27 20:13:38,743 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://nezabudka1-00.yandex.ru:8088/proxy/application_1431704916575_2531/
> 2015-05-27 20:13:38,746 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_1431704916575_2531
> 2015-05-27 21:04:12,353 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_1431704916575_2531 running in uber mode : false
> 2015-05-27 21:04:12,356 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%
> 2015-05-27 21:04:12,374 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1375)) - Job job_1431704916575_2531 failed with state FAILED due to: Application application_1431704916575_2531 failed 2 times due to ApplicationMaster for attempt appattempt_1431704916575_2531_000002 timed out. Failing the application.
> 2015-05-27 21:04:12,473 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 0
> 2015-05-27 21:04:12,474 ERROR [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1019)) - Job not Successful!
> Streaming Command Failed!
> This is because yarn.am.liveness-monitor.expiry-interval-ms (defaults to 600 sec) timeout expires before large data file is transferred.
> Next step I increase yarn.am.liveness-monitor.expiry-interval-ms.  After that application is successfully initialized and tasks are spawned.
> But I encounter another error: the default 600 seconds mapreduce.task.timeout expires before tasks are initialized and tasks fail.
> Error message Task attempt_XXX failed to report status for 600 seconds is also misleading: this timeout is supposed to kill non-responsive (stuck) tasks but it rather strikes because auxiliary data files are copying slowly.
> So I need to increase mapreduce.task.timeout too and only after that my job is successful.
> At the very least error messages need to be tweaked to indicate that Application (or Task) is failing because auxiliary files are not copied during that time, not just generic "timeout expired".
> Better solution would be not to account time spent for data files distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)