You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Louis CHAN <lo...@gmail.com> on 2010/08/04 11:40:33 UTC

Re: [jira] Commented: (NUTCH-629) Detect slow and timeout servers and drop their URLs

Thank you for replying

Description of the patch 629, it purges hosts if download speed is too low (
speed limit, number of pages minimum fetched and among of pages remaining)
or if there are too many errors (percentage and among of pages fetched
(successfully or not))

I think that the pach 769 is less precise about the occuring errors.

example :
fetcher.max.exceptions.per.queue = 35
If we have 40 pages dead  (404) in the 400 pages of a host given, the host
would be purges wheras there were only 10% of dead pages
So, We would increase fetcher.max.exceptions.per.queue.
However, In the case of a unknown-host, we would lose much time ...


I think that, it's better either to change fetcher.max.exceptions.per.queue
into a percentage or to keep it absolute and say that the among of error
allowed have to be reach in a ruch.

Your patch 770 is quite good

Thanks
Louis



2010/7/29 Julien Nioche (JIRA) <ji...@apache.org>

>
>    [
> https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893684#action_12893684]
>
> Julien Nioche commented on NUTCH-629:
> -------------------------------------
>
> The 2 features below have been added to 1.1 and provide something
> comparable
>
> https://issues.apache.org/jira/browse/NUTCH-769 : Fetcher to skip queues
> for URLS getting repeated exceptions
> https://issues.apache.org/jira/browse/NUTCH-770 : Timebomb for Fetcher
>
>
> > Detect slow and timeout servers and drop their URLs
> > ---------------------------------------------------
> >
> >                 Key: NUTCH-629
> >                 URL: https://issues.apachost if download speed is too
> low or if there are te.org/jira/browse/NUTCH-629<https://issues.apache.org/jira/browse/NUTCH-629>
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: fetcher
> >            Reporter: Otis Gospodnetic
> >            Assignee: Otis Gospodnetic
> >         Attachments: NUTCH-629.patch
> >
> >
> > Fetch jobs will finish faster if we find a way to prevent servers that
> are either slow or time out from slowing down the whole process.
> > I'll attach a patch that counts per-server exceptions and timeouts and
> tracks download speed per server.
> > Queues/sservers that exceed timeout or download thresholds are marked as
> "tooManyErrors" or "tooSlow".  Once they get marked as such, all of their
> subsequent URLs get dropped (i.e. they do not fetched) and marked GONE.
> > At the end of the fetch task, stats for each server processed are
> printed.
> > Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be
> the right place to add server data collected by this patch.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>