You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Bhallamudi Venkata Siva Kamesh (JIRA)" <ji...@apache.org> on 2011/02/17 16:19:24 UTC

[jira] Commented: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995845#comment-12995845 ] 

Bhallamudi Venkata Siva Kamesh commented on MAPREDUCE-1213:
-----------------------------------------------------------

While analyzing the patch, I found an issue, The below moveAndDelete method is called from both jobTracker and TaskTracker. JobTracker calls the below snippet on it's JobTracker folder and TaskTracker on it's TaskTracker folder(ex: /home/hadoop/tasktracker/local). This method renames the current folder and deletes it asynchronously. Let us assume the deletion step failed due to some reason (Like abrupt kill or some thing else), then the renamed folders are never deleted by any one. 




{code:title=MRAsyncDiskService.java|borderStyle=solid}

public boolean moveAndDelete(String volume, String pathName) throws IOException {
    // Move the file right now, so that it can be deleted later
    String newPathName;
    synchronized (this) {
      newPathName = format.format(new Date()) + "_" + uniqueId;
      uniqueId ++;
    }
    newPathName = SUBDIR + Path.SEPARATOR_CHAR + newPathName;

    Path source = new Path(volume, pathName);
    Path target = new Path(volume, newPathName);
    try {
      if (!localFileSystem.rename(source, target)) {
        return false;
      }
    } catch (FileNotFoundException e) {
      // Return false in case that the file is not found.
      return false;
    }
    DeleteTask task = new DeleteTask(volume, pathName, newPathName);
    execute(volume, task);
    return true;
  }
{code}

> TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1213
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>             Fix For: 0.21.0
>
>         Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch, MAPREDUCE-1213.branch-0.20.2.patch, MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively delete all the file in the distributed cache. It invoked FileUtil.fullyDelete() which is very very slow. This means that the TaskTracker cannot join the cluster for an extended period of time (upto 2 hours for us). The problem is acute if the number of files in a distributed cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira