You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2009/05/15 23:47:45 UTC

[jira] Created: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

JobTracker accepts heartbeats before startup is complete
--------------------------------------------------------

                 Key: HADOOP-5852
                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Todd Lipcon
            Priority: Critical


When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710099#action_12710099 ] 

Devaraj Das commented on HADOOP-5852:
-------------------------------------

This issue has been resolved in 0.20 IIRC, where the RPC handlers are started at the end of all initialization. Look at JobTracker.main - it constructs the JobTracker object and then invokes offerService, and towards the end of offerService, the interTrackerServer is started..

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710558#action_12710558 ] 

Todd Lipcon commented on HADOOP-5852:
-------------------------------------

Thanks, Devaraj and Tom.

I agree that passing "this" out of a constructor is bad Java style, but it's clearly already done since the IPC handlers are instantiated with a reference to the NN object.

Given that this can cause serious issues at startup time, I'd like to target a fix for the 18 branch. Any opinions on that? We could either backport the "offerService" refactor, or simply leak "this" again as proposed in my option #1 above.

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved HADOOP-5852.
---------------------------------

    Resolution: Invalid

Sorry guys, false alarm. Turns out this bug does not exist in unpatched branch-18 -- it was introduced by the HADOOP-3746 (fair scheduler) backport patch. I'll comment on that JIRA. Closing as invalid.

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709986#action_12709986 ] 

Todd Lipcon commented on HADOOP-5852:
-------------------------------------

>From the JT log:

{noformat}
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 8021
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030
2009-05-15 14:27:30,521 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: hdfs://localhost/var/lib/hadoop/cache/hadoop/mapred/system
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot delete /var/lib/hadoop/cache/hadoop/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.

...

2009-05-15 14:27:32,202 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/todd-laptop
2009-05-15 14:27:32,204 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8021, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@7461373f, true, true, -1) from 127.0.0.1:36984: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.mapred.JobQueueTaskScheduler.assignTasks(JobQueueTaskScheduler.java:85)
        at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1285)

[infinite loop of above NPEs]
{noformat}

>From the TT log:

{noformat}
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: localhost/127.0.0.1:37148
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,195 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,208 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'localhost' with reponseId '-1
2009-05-15 14:27:32,220 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'localhost' with reponseId '-1
etc etc
{noformat}

These logs are from 0.18.3, but the code seems to indicate this is still an issue in trunk. This happens fairly reliably when I start up all of my hadoop daemons at the exact same time -- the TT just needs to send its first heartbeat to the JT while the NN is still in safe mode.

The problem lies in the fact that the TaskScheduler's TaskTrackerManager isn't set until after the JobTracker constructor returns. The IPC handlers, however, are started in the middle of the constructor. Therefore, heartbeats can be received when the TaskTrackerManager is null, resulting in the NPE.

Possibly solution #1:
{code}
    taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, conf);
+ taskScheduler.setTaskTrackerManager(this);
{code}
(and remove that line from startTracker())

Possibly solution #2: delay startup of RPC servers until after the JT object is fully initialized and in the RUNNING state, or at least has all of its members initialized.

I like #1 a lot better - it seems odd that this setter is happening at such a late time.

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710399#action_12710399 ] 

Tom White commented on HADOOP-5852:
-----------------------------------

Background on why I think setTaskTrackerManager() was being called outside the constructor. It's not generally a good idea to pass "this" from the constructor to another object, since the object hasn't finished initialization so may not be in a safe state (http://www.ibm.com/developerworks/java/library/j-jtp0618.html#2).

HADOOP-3628 introduces an initialize method which will provide a standard place to do initialization.
 

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.