You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/07/27 13:12:14 UTC

[jira] Created: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
----------------------------------------------------------------------------------

                 Key: NUTCH-331
                 URL: http://issues.apache.org/jira/browse/NUTCH-331
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8-dev, 0.9-dev
            Reporter: Andrzej Bialecki 
            Priority: Critical
             Fix For: 0.8-dev, 0.9-dev


Each Fetcher task starts multiple FetcherThreads, which consume the input fetchlist. These threads may block for a long time after being started and after reading their input fetchlist entries, due to "politeness" settings. However, the map-reduce framework considers the task as complete when all input data is read.

This causes the tasktracker to incorreclty assume that task processing is complete (because the task progress is 1.0, since all input has been consumed), whereas many URLs from the fetchlist may still be waiting for fetching, in blocked threads. The more threads is used the more apparent is this problem, because the final number of fetched pages may be short of the target number by as many as (numThreads * numMapTasks) entries.

The final result of this is that only a part of the fetchlist is fetched, because Fetcher map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-331?page=all ]

Sami Siren updated NUTCH-331:
-----------------------------

    Fix Version/s:     (was: 0.8)

> Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-331
>                 URL: http://issues.apache.org/jira/browse/NUTCH-331
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.9
>            Reporter: Andrzej Bialecki 
>            Priority: Critical
>             Fix For: 0.9
>
>
> Each Fetcher task starts multiple FetcherThreads, which consume the input fetchlist. These threads may block for a long time after being started and after reading their input fetchlist entries, due to "politeness" settings. However, the map-reduce framework considers the task as complete when all input data is read.
> This causes the tasktracker to incorreclty assume that task processing is complete (because the task progress is 1.0, since all input has been consumed), whereas many URLs from the fetchlist may still be waiting for fetching, in blocked threads. The more threads is used the more apparent is this problem, because the final number of fetched pages may be short of the target number by as many as (numThreads * numMapTasks) entries.
> The final result of this is that only a part of the fetchlist is fetched, because Fetcher map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira