You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Ivan Mitic (JIRA)" <ji...@apache.org> on 2013/10/01 00:00:24 UTC

[jira] [Updated] (MAPREDUCE-5512) TaskTracker hung after failed reconnect to the JobTracker

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Mitic updated MAPREDUCE-5512:
----------------------------------

    Attachment: MAPREDUCE-5512.branch-1.patch

Attaching the patch.

My proposal for the fix is to make the dist cache cleanup thread a daemon. Based on the scan thru the code I think it should be safe to make this change. 

For the unittest, I added a test that validates the list of non-daemon threads. This is a more general test case but I think it will serve well to protect the codebase against regressions in this area. I was not able to come up with a nice way to simulate the condition from this bug without adding a test hook in the production code, so I moved away from this approach (we would have to start JT, stop JT, start JT again which would tell TT to reinit, and then stop JT, but last JT stop must have the right timing and run before TT#initialize() executes).

Slightly orthogonally, looking at the list of threads I had to whitelist, there might be some other candidate threads that could be made daemons, but I'd prefer not to make this change in the context of this Jira.

> TaskTracker hung after failed reconnect to the JobTracker
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5512
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5512
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: hadoop-tasktracker-RD00155DD09100.log, MAPREDUCE-5512.branch-1.patch, tt_Hung.txt
>
>
> TaskTracker hung after failed reconnect to the JobTracker. 
> This is the problematic piece of code:
> {code}
>     this.distributedCacheManager = new TrackerDistributedCacheManager(
>         this.fConf, taskController);
>     this.distributedCacheManager.startCleanupThread();
>     
>     this.jobClient = (InterTrackerProtocol) 
>     UserGroupInformation.getLoginUser().doAs(
>         new PrivilegedExceptionAction<Object>() {
>       public Object run() throws IOException {
>         return RPC.waitForProxy(InterTrackerProtocol.class,
>             InterTrackerProtocol.versionID,
>             jobTrackAddr, fConf);
>       }
>     });
> {code}
> In case RPC.waitForProxy() throws, TrackerDistributedCacheManager cleanup thread will never be stopped, and given that it is a non daemon thread it will keep TT up forever.



--
This message was sent by Atlassian JIRA
(v6.1#6144)