You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2017/12/11 17:41:00 UTC
[jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

     [ https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe reassigned MAPREDUCE-7022:
-------------------------------------

    Assignee: Johan Gustavsson

Thanks for the patch!  Comments on the approach:

This is going to have some issues compared to the FileSystem counter approach.  The directories being checked by the patch are _not_ specific to an attempt.  If the intent was to cover just the container working directory then that's not what this patch does.  It is checking the application directories and thus is checking the disk usage _across task attempts_ on a node's disk.  If too many task attempts run on a node then this threshold could be triggered, causing the entire job to fail, when not all of the data originated from a single task attempt.  This isn't necessarily desirable, since the task getting retried on another node could run just fine and the job would succeed.

It's pretty confusing to have both mapreduce.task.local-fs.limit.bytes and mapreduce.task.local-fs.write-limit.bytes.  Users are likely going to set both of these to the same value since they won't understand the difference, or they will only set one of them.  As mentioned above this isn't really a per-task thing as implemented in the patch, so maybe we should be using 'job' here instead of 'task'.  Also it's not clear from the description this is per-disk and across tasks rather than per task and across disks as implemented by the original write-limit.

This is going to add a disk I/O dependency to every task heartbeat where the task attempt needs to touch every disk.  There will be cases where this feature will cause tasks to timeout/fail when they didn't before because this will repeatedly touch bad/slow disks.  I'm not sure if it makes more sense to have this check less often in a background thread rather than inlined with the other task reporting logic.

Comments on the code changes:

shouldFf does not seem appropriate on the generic TaskAttemptEvent.  It's really only a flag that makes sense on one very specific message type, TA_FAILMSG.  It would be cleaner if there was a derived TaskAttemptEvent class for task attempt failure events where this would be an appropriate method.

I don't see why failFast needs to be persisted as state in the task.  There's only one transition that needs it, FailedTransition, and the code can convey the event boolean through to notifyTaskAttemptFailed as an argument to that method rather than poking into state and having that method pick it up from there.

Looking up a conf value that isn't changing is wasteful (hashmap lookup followed by string marshalling).  The code can cache the value in an initialize method and use that cached value.

FileUtil.getDU should be used instead of FileUtils.sizeOfDirectory so we don't have to add a commons IO dependency to mapreduce-client-core.

TaskAttemptListenerImpl#fatalErrorFF needs an Override decorator.

Nit: Does it make sense to track the largest size when any size above the usage is going to make it fail?  Likely there's only one at the point where it's going to fail since it's constantly checking, so I'm not seeing the value of tracking the largest here.  Not a must-fix, just curious why it would be necessary.

Nit: The 'FF' acronym could use a more descriptive name since many won't know what 'FF' stands for initially.  Would it make more sense to call it something like TaskFailed (e.g.: fatalErrorTaskFailed) to convey this is a fatal error in an attempt that causes then entire task to fail?

There are a lot of unnecessary whitespace and reformatting changes in TaskAttemptListenerImpl in areas unrelated to what needs to be changed.   That makes the patch larger than it needs to be and more difficult to backport.


> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue tasks based on writes to local disk writes. In our environment are we mainly run Hive based jobs we noticed that this counter and the size of the local scratch dirs were very different. We had tasks where BYTES_WRITTEN counter were at 300Gb and where it was at 10Tb both producing around 200Gb on local disk, so it didn't help us much. So to extend this feature tasks should monitor local scratchdir size and fail if they pass the limit. In these cases the tasks should not be retried either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org