You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/04/25 15:22:15 UTC
[jira] Updated: (HADOOP-1252) Disk problems should be handled better by the MR framework

     [ https://issues.apache.org/jira/browse/HADOOP-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-1252:
--------------------------------

    Attachment: 1252.patch

Adopted a simple strategy for this:
Basically, the strategy is to do round-robin for disk selection during "write" and return the first one that satisfies the space requirements.  When we are given a path to "read" we check each disk to locate the path. New APIs have been added in FileUtil called getLocalPathForWrite and getLocalPathForRead. The write API takes an optional size argument. If the disk that a Map task is currently using happens to run out of disk space (for e.g., if two tasks are using the same disk) then the Map task fails as usual. In the case of Reduce tasks a particular instance of a fetch may fail but the task itself won't fail unless it doesn't find any available disk. However, for merges, the reduce task would fail if while spilling the data to the returned disk, the disk runs out of space (similar to the map tasks case).
Also added APIs in MapOutputFile for creating map/shuffle files and equivalent APIs for reading those. Also replaced calls to Configuration.getLocalPath to use the new MapOutputFile APIs wherever possible and in some cases to use the new FileUtil APIs.

Comments?

> Disk problems should be handled better by the MR framework
> ----------------------------------------------------------
>
>                 Key: HADOOP-1252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1252
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.12.3
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>             Fix For: 0.13.0
>
>         Attachments: 1252.patch
>
>
> The MR framework should recover from Disk Failure problems without causing jobs to hang. Note that this issue is about a short-term solution to solving the problem. For example, by looking at the code and improving the exception handling (to better detect faulty disks and missing files). The long term approach might be to have a FS layer that takes care of failed disks and makes it transparent to the tasks. That will be a separate issue by itself.
> Some of the issues that have been reported are HADOOP-1087 and a comment by Koji on HADOOP-1200 (not sure whether those are all). Please add to this issue as much details as possible on disk failures leading to hung jobs (details like relevant exception traces, way to reproduce, etc.).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.