You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by stack <st...@archive.org> on 2006/03/03 18:10:40 UTC

Pulsing TaskTrackers

jobtracker.jsp shows the list of TaskTrackers over the course of a job -- in a HADOOP-16 -like manner --  shrinking.  No errors show in the jobdetails.jsp to explain the disappeared.  I see this behavior in a fresh svn pull from yesterday morning.  I also retried post a Doug commit of a HADOOP-16 patch yesterday afternoon.  Any pointers appreciated.

Here's a description of what I'm seeing.

I submit my job.  Input is a file of about 25k lines.  All seems to come up fine with all slaves present and correct.  Watching the jobtracker.jsp, I see ever-increasing 'Secs since heartbeat' for some trackers.  After a short while I start seeing below in jobtracker log:

...
060302 202434 Task 'task_m_112z0h' has been lost.
060302 202434 Task 'task_r_5kc1wr' has been lost.
060302 202434 Lost tracker 'tracker_14653'
060302 202434 Task 'task_m_40pz2i' has been lost.
060302 202434 Task 'task_r_10mvk5' has been lost.
...

...and tasktrackers then drop off the jobtracker screen.

As time goes by, fewer and fewer tasktrackers show in jobtracker.jsp... Usually the job eventually fails.

ipc timeout is an hour.

Looking out on the slaves, they seem to be humming along merrily -- until what looks like a kill of the running child out of tasktracker.

I see this behaviour on two different deploys. Will keep digging but if suggestions for what to try, send them on over.

Thanks,
St.Ack

Re: Pulsing TaskTrackers

Posted by Michael Stack <st...@archive.org>.

Doug Cutting wrote:
> stack wrote:
>> ipc timeout is an hour.
>
> FYI, you should no longer need to set this so high.  I've left it at 
> the default for recent runs.  But I doubt that is what is causing your 
> difficulties...
>
I just did the same.  A high IPC is not a good idea as the hung 
connection will tend to obscure any actual issues and prevent timely 
rescheduling of tasks.
Thanks,
St.Ack
 
> Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by Doug Cutting <cu...@apache.org>.

stack wrote:
> Is there a configurable timeout that says how long jobtrackers wait on 
> communique from tasktrackers?

It is currently a constant, TASKTRACKER_EXPIRY_INTERVAL in 
MRConstants.java, set to 60 seconds.

Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by Doug Cutting <cu...@apache.org>.

stack wrote:
>> I think I see what the problem is.  The job jar is copied out of dfs 
>> to the local filesystem in the top-level loop of the tasktracker, not 
>> in the TaskRunner, which runs as a separate thread.  This can cause 
>> tasktrackers to time out.  So we should move that part of 
>> localizeTask() into TaskRunner.run() to avoid this.
>>
> Perhaps.  My job jar is large.  Its nutch and then some.

I profiled things and found another culprit, now patched.  We should 
still move this copying out of the main tasktracker loop, but I think 
the reduce tasks were choking the jobtracker while polling for map 
output.  Please tell me if you still see problems.

Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by Doug Cutting <cu...@apache.org>.

stack wrote:
> Upping the hardcoded timeout from 60seconds to ten minutes also helped.  
> I see some tasks in jobtracker.jsp with times-since-last-communication 
> north of 60 seconds that subsequently recover.  Perhaps I should add a 
> patch that makes this configurable?

I think once we move the jar copying into a separate thread then this 
will no longer be an issue.  So don't bother with a patch just yet.

Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by stack <st...@archive.org>.

Doug Cutting wrote:
> stack wrote:
>> Is there a configurable timeout that says how long jobtrackers wait 
>> on communique from tasktrackers?
>
It looks like its my job thats the problem.  I've moved to new hardware 
and os and the /tmp dir is of a smaller size over-filling with temporary 
files as the job ran failing silently.  Now the tasktrackers stick around.

Upping the hardcoded timeout from 60seconds to ten minutes also helped.  
I see some tasks in jobtracker.jsp with times-since-last-communication 
north of 60 seconds that subsequently recover.  Perhaps I should add a 
patch that makes this configurable?

> I think I see what the problem is.  The job jar is copied out of dfs 
> to the local filesystem in the top-level loop of the tasktracker, not 
> in the TaskRunner, which runs as a separate thread.  This can cause 
> tasktrackers to time out.  So we should move that part of 
> localizeTask() into TaskRunner.run() to avoid this.
>
Perhaps.  My job jar is large.  Its nutch and then some.

Thanks,
St.Ack


> Also, it is rather confusing that there are two classes named 
> TaskInProgress, one nested in TaskTracker and one used by the 
> JobTracker...
>
> Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by Doug Cutting <cu...@apache.org>.

stack wrote:
> Is there a configurable timeout that says how long jobtrackers wait on 
> communique from tasktrackers?

I think I see what the problem is.  The job jar is copied out of dfs to 
the local filesystem in the top-level loop of the tasktracker, not in 
the TaskRunner, which runs as a separate thread.  This can cause 
tasktrackers to time out.  So we should move that part of localizeTask() 
into TaskRunner.run() to avoid this.

Also, it is rather confusing that there are two classes named 
TaskInProgress, one nested in TaskTracker and one used by the JobTracker...

Doug

Re: Disappearing TaskTrackers (Was: Pulsing TaskTrackers)

Posted by stack <st...@archive.org>.

Doug Cutting wrote:
> stack wrote:
>> ipc timeout is an hour.
>
> FYI, you should no longer need to set this so high.  I've left it at 
> the default for recent runs.  But I doubt that is what is causing your 
> difficulties...
>
> Doug
Is there a configurable timeout that says how long jobtrackers wait on 
communique from tasktrackers?
Thanks,
St.Ack

Re: Pulsing TaskTrackers

Posted by Doug Cutting <cu...@apache.org>.

stack wrote:
> ipc timeout is an hour.

FYI, you should no longer need to set this so high.  I've left it at the 
default for recent runs.  But I doubt that is what is causing your 
difficulties...

Doug

Re: Pulsing TaskTrackers

Posted by stack <st...@archive.org>.

Doug Cutting wrote:
> I just fixed a few more tasktracker/jobtracker bugs.  I'm currently 
> running a test crawl on 20 machines & everything looks good.  Please 
> give it a try with the latest & tell me whether things improve.

I updated (I'm at revision 382966) and restarted.  Seems to lose tasks 
at faster rate now.  35 slaves.  4 Tasks per node.  Will try with less 
tasks per host.
St.Ack

>
> Doug
>
> stack wrote:
>> jobtracker.jsp shows the list of TaskTrackers over the course of a 
>> job -- in a HADOOP-16 -like manner --  shrinking.  No errors show in 
>> the jobdetails.jsp to explain the disappeared.  I see this behavior 
>> in a fresh svn pull from yesterday morning.  I also retried post a 
>> Doug commit of a HADOOP-16 patch yesterday afternoon.  Any pointers 
>> appreciated.
>>
>> Here's a description of what I'm seeing.
>>
>> I submit my job.  Input is a file of about 25k lines.  All seems to 
>> come up fine with all slaves present and correct.  Watching the 
>> jobtracker.jsp, I see ever-increasing 'Secs since heartbeat' for some 
>> trackers.  After a short while I start seeing below in jobtracker log:
>>
>> ...
>> 060302 202434 Task 'task_m_112z0h' has been lost.
>> 060302 202434 Task 'task_r_5kc1wr' has been lost.
>> 060302 202434 Lost tracker 'tracker_14653'
>> 060302 202434 Task 'task_m_40pz2i' has been lost.
>> 060302 202434 Task 'task_r_10mvk5' has been lost.
>> ...
>>
>> ...and tasktrackers then drop off the jobtracker screen.
>>
>> As time goes by, fewer and fewer tasktrackers show in 
>> jobtracker.jsp... Usually the job eventually fails.
>>
>> ipc timeout is an hour.
>>
>> Looking out on the slaves, they seem to be humming along merrily -- 
>> until what looks like a kill of the running child out of tasktracker.
>>
>> I see this behaviour on two different deploys. Will keep digging but 
>> if suggestions for what to try, send them on over.
>>
>> Thanks,
>> St.Ack
>>

Re: Pulsing TaskTrackers

Posted by Doug Cutting <cu...@apache.org>.

I just fixed a few more tasktracker/jobtracker bugs.  I'm currently 
running a test crawl on 20 machines & everything looks good.  Please 
give it a try with the latest & tell me whether things improve.

Doug

stack wrote:
> jobtracker.jsp shows the list of TaskTrackers over the course of a job 
> -- in a HADOOP-16 -like manner --  shrinking.  No errors show in the 
> jobdetails.jsp to explain the disappeared.  I see this behavior in a 
> fresh svn pull from yesterday morning.  I also retried post a Doug 
> commit of a HADOOP-16 patch yesterday afternoon.  Any pointers appreciated.
> 
> Here's a description of what I'm seeing.
> 
> I submit my job.  Input is a file of about 25k lines.  All seems to come 
> up fine with all slaves present and correct.  Watching the 
> jobtracker.jsp, I see ever-increasing 'Secs since heartbeat' for some 
> trackers.  After a short while I start seeing below in jobtracker log:
> 
> ...
> 060302 202434 Task 'task_m_112z0h' has been lost.
> 060302 202434 Task 'task_r_5kc1wr' has been lost.
> 060302 202434 Lost tracker 'tracker_14653'
> 060302 202434 Task 'task_m_40pz2i' has been lost.
> 060302 202434 Task 'task_r_10mvk5' has been lost.
> ...
> 
> ...and tasktrackers then drop off the jobtracker screen.
> 
> As time goes by, fewer and fewer tasktrackers show in jobtracker.jsp... 
> Usually the job eventually fails.
> 
> ipc timeout is an hour.
> 
> Looking out on the slaves, they seem to be humming along merrily -- 
> until what looks like a kill of the running child out of tasktracker.
> 
> I see this behaviour on two different deploys. Will keep digging but if 
> suggestions for what to try, send them on over.
> 
> Thanks,
> St.Ack
>