You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/10/10 15:34:26 UTC

reprocessing hanging tasks

Hi,
I tried to understand the jobtracker code.
Hmm more than 1000 lines of code in just one class. :-( This makes  
understanding code very difficult.

Anyway I'm missing a mechanism to reprocess hanging tasks. May I just  
didn't find the code, but I invest some time to find it.
As the google paper describe the original map reduce reprocess tasks  
that may still run but are much slower than the other tasks because  
of some hardware failures.
Since I notice that task-tracker isn't that stabile yet, I would  
really love to have such a reprocessing mechanism.
Actually I seen tasks are reprocessed in case the task-tracker crash  
and does not return any reports anymore or the task-tracker report a  
task failure.
But for example in case the network speed of a fetching mapping task  
is very very slow the job itself needs for ever.

I would suggest add start time and finishing time to the task object  
and set these values until status changes.
We can calculate a average time a task need for processing based on  
this values.
Than we have a configurable value of minimal finished tasks before we  
start to reprocessing tasks. For example 80% tasks need to be ready.
Further more we have a configurable values threshold, in case the  
processing time of a task is treshold * average processing time, we  
just reprocessing the task on a other tasktracker.

What do people think?
Do I miss the section in the jobtracker where this is done, or are  
people interested that I submit a patch doing this mechanism?

Stefan

Re: reprocessing hanging tasks

Posted by Doug Cutting <cu...@nutch.org>.

Stefan Groschupf wrote:
> May we misunderstand each other, I do not mean tasks that crash, I  mean 
> tasks that are 20 times slower on one machine as the other tasks  on the 
> other machines.

Ah, I call that "speculative re-exectution".  Nutch does not yet 
implement that.

I don't think speculative re-execution of tasks would help much with 
fetching, since a fetch task that is slow on one machine will probably 
be slow on another.  What would probably make the fetcher faster is to 
use Thread.kill() on fetcher threads which have exceeded a timeout, and 
then replace them with a new Fetcher thread.

Speculative re-execution is among the list of features we'd like to add, 
but it is not a high priority for me.

Doug

Re: reprocessing hanging tasks

Posted by Stefan Groschupf <sg...@media-style.com>.

Doug,
I definitely run several times in problems, where task-trackers was  
sending hard-beat messages but hadn't process the job anymore.
For example no new pages was fetched but the page / sec. statistic  
becomes slow and slower.
I personal would think it makes more sense in case the jobtracker  
decide if a task is over the average processing time and need to be  
reexcuted or not.
The last section of the google paper covers this issue and they  
notice performance improvements by reexecutng task that are over a  
specific time.

May we misunderstand each other, I do not mean tasks that crash, I  
mean tasks that are 20 times slower on one machine as the other tasks  
on the other machines.

Stefan

Am 10.10.2005 um 20:16 schrieb Doug Cutting:

> Stefan Groschupf wrote:
>
>> Do I miss the section in the jobtracker where this is done, or  
>> are  people interested that I submit a patch doing this mechanism?
>>
>
> This is mostly already implemented.  The tasktracker fails tasks  
> that do not update their status within a configurable timeout.   
> Task status is updated each time a task reads an input, writes an  
> output or calls the Reporter.setStatus() method.  The jobtracker  
> will retry failed tasks up to four times.
>
> The mapred-based fetcher also should not hang.  It will exit even  
> when it has hung threads.  So the task timeout should be set to the  
> maximum amount of time that any single page should require to fetch  
> & parse.  By default it is set to 10 minutes.
>
> Doug
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

Re: reprocessing hanging tasks

Posted by Doug Cutting <cu...@nutch.org>.

Stefan Groschupf wrote:
> Do I miss the section in the jobtracker where this is done, or are  
> people interested that I submit a patch doing this mechanism?

This is mostly already implemented.  The tasktracker fails tasks that do 
not update their status within a configurable timeout.  Task status is 
updated each time a task reads an input, writes an output or calls the 
Reporter.setStatus() method.  The jobtracker will retry failed tasks up 
to four times.

The mapred-based fetcher also should not hang.  It will exit even when 
it has hung threads.  So the task timeout should be set to the maximum 
amount of time that any single page should require to fetch & parse.  By 
default it is set to 10 minutes.

Doug