You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stefan Will <st...@gmx.net> on 2009/02/09 20:40:24 UTC

TaskTrackers being double counted after restart job recovery

Hi,

I¹m using the new persistent job state feature in 0.19.0, and it¹s worked
really well so far. However, this morning my JobTracker died with and OOM
error (even though the heap size is set to 768M). So I killed it and all the
TaskTrackers. After starting everything up again, all my nodes were showing
up twice in the JobTracker web interface, with different port numbers. Also,
some of the jobs it restarted had already been completed when the job
tracker died.

 Any idea what might be happening here ? How can I fix this ? Will
temporarily setting mapred.jobtracker.restart.recover=false clear things up
?

-- Stefan

Re: TaskTrackers being double counted after restart job recovery

Posted by Owen O'Malley <ow...@gmail.com>.
There is a bug that when we restart the TaskTrackers they get counted twice.
The problem is the name is generated from the hostname and port number. When
TaskTrackers restart they get a new port number and get counted again. The
problem goes away when the old TaskTrackers time out in 10 minutes or you
restart the JobTracker.

-- Owen

Re: TaskTrackers being double counted after restart job recovery

Posted by Amar Kamat <am...@yahoo-inc.com>.
Stefan Will wrote:
> Hi,
>
> I¹m using the new persistent job state feature in 0.19.0, and it¹s worked
> really well so far. However, this morning my JobTracker died with and OOM
> error (even though the heap size is set to 768M). So I killed it and all the
> TaskTrackers. 
Any specific reason why you killed the task-trackers? Ideally the 
JobTracker should be restarted and the task-trackers will join.
> After starting everything up again, all my nodes were showing
> up twice in the JobTracker web interface, with different port numbers. 
Owen is correct. Since the state is rebuild from history, the old 
tracker information is obtained from the history and hence the double. 
Killing the tasktracker after killing the jobtracker is like losing the 
tasktracker while the jobtracker
is down. Upon restart the jobtracker assumes that the tasktracker 
mentioned in the history is still available and waits for it to 
re-connect. After the expiry interval (default 10 mins), the old tracker 
will be removed.
> Also,
> some of the jobs it restarted had already been completed when the job
> tracker died.
>   
Old job detection happens via the system directory. When a job is 
submitted, its info (job.xml etc) is copied to the mapred system dir and 
upon completion its removed from there. Upon restart, the 
mapred-system-dir is checked to see what all jobs needs to be 
resumed/re-run. So if the job folder/info is not cleaned up from the 
system-dir then the job will be resumed. But if the job was complete 
then the job logs should mention that its complete and hence upon 
restart it will simple finish/complete the job without even running any 
task. Are you seeing something different here? Look at jobtracker logs 
to see what is happening in the recovery. The line "Restoration 
complete" marks the end of recovery.
>  Any idea what might be happening here ? How can I fix this ? Will
> temporarily setting mapred.jobtracker.restart.recover=false clear things up
> ?
>   
You can manually delete job files from mapred.system.dir to avoid 
resuming that job.
Amar
> -- Stefan
>
>