You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/11/23 11:58:03 UTC

[jira] Closed: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

     [ http://issues.apache.org/jira/browse/NUTCH-331?page=all ]

Andrzej Bialecki  closed NUTCH-331.
-----------------------------------

    Resolution: Cannot Reproduce
      Assignee: Andrzej Bialecki 

I can no longer reproduce this issue, and I suspect this was caused by NUTCH-361 .

> Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-331
>                 URL: http://issues.apache.org/jira/browse/NUTCH-331
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>            Priority: Critical
>             Fix For: 0.9.0
>
>
> Each Fetcher task starts multiple FetcherThreads, which consume the input fetchlist. These threads may block for a long time after being started and after reading their input fetchlist entries, due to "politeness" settings. However, the map-reduce framework considers the task as complete when all input data is read.
> This causes the tasktracker to incorreclty assume that task processing is complete (because the task progress is 1.0, since all input has been consumed), whereas many URLs from the fetchlist may still be waiting for fetching, in blocked threads. The more threads is used the more apparent is this problem, because the final number of fetched pages may be short of the target number by as many as (numThreads * numMapTasks) entries.
> The final result of this is that only a part of the fetchlist is fetched, because Fetcher map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira