You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Xi Fang (JIRA)" <ji...@apache.org> on 2013/09/14 02:06:52 UTC

[jira] [Commented] (MAPREDUCE-5508) Memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767210#comment-13767210 ] 

Xi Fang commented on MAPREDUCE-5508:
------------------------------------

This bug was found in Microsoft's large scale test with about 200,000 job submissions. The memory usage is steadily growing up. 

There is a long discussion between Hortonworks (thanks [~cnauroth] and [~vinodkv]) and Microsoft on this issue. Here is the summary of the discussion.

1. The heap dumps are showing DistributedFileSystem instances that are only referred to from the cache's HashMap entries. Since nothing else has a reference, nothing else can ever attempt to close it, and therefore it will never be removed from the cache. 

2. The special check for "tempDirFS" (see code in description) in the patch for MAPREDUCE-5351 is intended as an optimization so that CleanupQueue doesn't need to immediately reopen a FileSystem that was just closed. However, we observed that we're getting different identity hash code values on the subject in the key. The code is assuming that CleanupQueue will find the same Subject that was used inside JobInProgress. Unfortunately, this is not guaranteed, because we may have crossed into a different access control context at this point, via UserGroupInformation#doAs. Even though it's conceptually the same user, the Subject is a function of the current AccessControlContext:
{code}
  public synchronized
  static UserGroupInformation getCurrentUser() throws IOException {
    AccessControlContext context = AccessController.getContext();
    Subject subject = Subject.getSubject(context);
{code}
Even if the contexts are logically equivalent between JobInProgress and CleanupQueue, we see no guarantee that Java will give you the same Subject instance, which is required for successful lookup in the FileSystem cache (because of the use of identity hash code).

A fix is abandon this optimization and close the FileSystem within the same AccessControlContext that opened it.  

                
> Memory leak caused by unreleased FileSystem objects in JobInProgress#cleanupJob
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5508
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5508
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1-win
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>            Priority: Critical
>
> MAPREDUCE-5351 fixed a memory leak problem but introducing another filesystem object that is properly released.
> {code} JobInProgress#cleanupJob()
>   void cleanupJob() {
> ...
>           tempDirFs = jobTempDirPath.getFileSystem(conf);
>           CleanupQueue.getInstance().addToQueue(
>               new PathDeletionContext(jobTempDirPath, conf, userUGI, jobId));
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira