You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2009/05/15 23:55:45 UTC

[jira] Commented: (HADOOP-5852) JobTracker accepts heartbeats before startup is complete

    [ https://issues.apache.org/jira/browse/HADOOP-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709986#action_12709986 ] 

Todd Lipcon commented on HADOOP-5852:
-------------------------------------

>From the JT log:

{noformat}
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 8021
2009-05-15 14:27:28,775 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030
2009-05-15 14:27:30,521 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: hdfs://localhost/var/lib/hadoop/cache/hadoop/mapred/system
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot delete /var/lib/hadoop/cache/hadoop/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.

...

2009-05-15 14:27:32,202 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/todd-laptop
2009-05-15 14:27:32,204 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8021, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@7461373f, true, true, -1) from 127.0.0.1:36984: error: java.io.IOException: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
        at org.apache.hadoop.mapred.JobQueueTaskScheduler.assignTasks(JobQueueTaskScheduler.java:85)
        at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1285)

[infinite loop of above NPEs]
{noformat}

>From the TT log:

{noformat}
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: localhost/127.0.0.1:37148
2009-05-15 14:27:32,124 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,195 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_todd-laptop:localhost/127.0.0.1:37148
2009-05-15 14:27:32,208 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'localhost' with reponseId '-1
2009-05-15 14:27:32,220 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'localhost' with reponseId '-1
etc etc
{noformat}

These logs are from 0.18.3, but the code seems to indicate this is still an issue in trunk. This happens fairly reliably when I start up all of my hadoop daemons at the exact same time -- the TT just needs to send its first heartbeat to the JT while the NN is still in safe mode.

The problem lies in the fact that the TaskScheduler's TaskTrackerManager isn't set until after the JobTracker constructor returns. The IPC handlers, however, are started in the middle of the constructor. Therefore, heartbeats can be received when the TaskTrackerManager is null, resulting in the NPE.

Possibly solution #1:
{code}
    taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, conf);
+ taskScheduler.setTaskTrackerManager(this);
{code}
(and remove that line from startTracker())

Possibly solution #2: delay startup of RPC servers until after the JT object is fully initialized and in the RUNNING state, or at least has all of its members initialized.

I like #1 a lot better - it seems odd that this setter is happening at such a late time.

> JobTracker accepts heartbeats before startup is complete
> --------------------------------------------------------
>
>                 Key: HADOOP-5852
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5852
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> When the JobTracker is instantiated, it starts listening on its RPC interfaces before its startup is complete (ie the constructor is finished executing). Because of this, jt.taskScheduler.taskTrackerManager can be null when the JT receives a heartbeat from a TT. This throws the JT/TT pair into a tight infinite loop (HADOOP-5761)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.