You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Otis Gospodnetic (JIRA)" <ji...@apache.org> on 2008/04/12 09:07:04 UTC

[jira] Created: (NUTCH-629) Detect slow and timeout servers and drop their URLs

Detect slow and timeout servers and drop their URLs
---------------------------------------------------

                 Key: NUTCH-629
                 URL: https://issues.apache.org/jira/browse/NUTCH-629
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Otis Gospodnetic


Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process.

I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server.
Queues/sservers that exceed timeout or download thresholds are marked as "tooManyErrors" or "tooSlow".  Once they get marked as such, all of their subsequent URLs get dropped (i.e. they do not fetched) and marked GONE.

At the end of the fetch task, stats for each server processed are printed.

Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the right place to add server data collected by this patch.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-629) Detect slow and timeout servers and drop their URLs

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588746#action_12588746 ] 

Otis Gospodnetic commented on NUTCH-629:
----------------------------------------

While the patch improves fetch speed when there are lots of timeouts, the "slow but not slow enough" servers are still a problem.  By that, I mean servers whose download speed is over the threshold, which stops them from just getting dropped from the fetch job, but slow enough that, if they have a lot of URLs in the fetchlist, they still take longer to fetch from, and thus draaaaaaag the fetch run out.

I *think* this download speed information has to be stored in the host DB (NUTCH-628).  Generator could then use this information when generating the fetchlist.  For hosts that are slower, it would generate fewer URLs, and for hosts that are faster, it could generate more URLs.  In the ideal scenario, I think, this would result in URLs from all hosts getting fetched around the same time.

Does this make sense?  Is my thinking OK or is it flawed?


> Detect slow and timeout servers and drop their URLs
> ---------------------------------------------------
>
>                 Key: NUTCH-629
>                 URL: https://issues.apache.org/jira/browse/NUTCH-629
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-629.patch
>
>
> Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process.
> I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server.
> Queues/sservers that exceed timeout or download thresholds are marked as "tooManyErrors" or "tooSlow".  Once they get marked as such, all of their subsequent URLs get dropped (i.e. they do not fetched) and marked GONE.
> At the end of the fetch task, stats for each server processed are printed.
> Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the right place to add server data collected by this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-629) Detect slow and timeout servers and drop their URLs

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated NUTCH-629:
-----------------------------------

    Attachment: NUTCH-629.patch

> Detect slow and timeout servers and drop their URLs
> ---------------------------------------------------
>
>                 Key: NUTCH-629
>                 URL: https://issues.apache.org/jira/browse/NUTCH-629
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-629.patch
>
>
> Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process.
> I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server.
> Queues/sservers that exceed timeout or download thresholds are marked as "tooManyErrors" or "tooSlow".  Once they get marked as such, all of their subsequent URLs get dropped (i.e. they do not fetched) and marked GONE.
> At the end of the fetch task, stats for each server processed are printed.
> Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the right place to add server data collected by this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-629) Detect slow and timeout servers and drop their URLs

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned NUTCH-629:
--------------------------------------

    Assignee: Otis Gospodnetic

> Detect slow and timeout servers and drop their URLs
> ---------------------------------------------------
>
>                 Key: NUTCH-629
>                 URL: https://issues.apache.org/jira/browse/NUTCH-629
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-629.patch
>
>
> Fetch jobs will finish faster if we find a way to prevent servers that are either slow or time out from slowing down the whole process.
> I'll attach a patch that counts per-server exceptions and timeouts and tracks download speed per server.
> Queues/sservers that exceed timeout or download thresholds are marked as "tooManyErrors" or "tooSlow".  Once they get marked as such, all of their subsequent URLs get dropped (i.e. they do not fetched) and marked GONE.
> At the end of the fetch task, stats for each server processed are printed.
> Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be the right place to add server data collected by this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.