You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2012/03/02 12:45:33 UTC

Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

Hi Guys,

As there were some comments on the user list, I recently got digging with
http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually' don't
consider them to be interlinked as such and therefore struggle to debug how
and why either the redirect or the crawl delay pages are not being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking
about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
users.
- Can someone shine some light on what happened to Fetcher2.java that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)
- For you guys managing/running/maintaining your own (and possibly
clients)  web servers, what are the perceptions of maintaining your own
AdaptiveCrawlDelay? Pro's and Con's (apart from the obvious)

I can't really think of anything else at the moment!

Thanks

Lewis

-- 
*Lewis*

Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Andrzej,

On Fri, Mar 2, 2012 at 12:37 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Fetcher2 is the current Fetcher. The original Fetcher was temporarily
> renamed OldFetcher and then removed.
>

So looks like this 'might' be more straight forward to implement than I
originally thought. When I get a bit of time I would like to dive into it.

Thanks

Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 02/03/2012 12:45, Lewis John Mcgibbney wrote:
> Hi Guys,
>
> As there were some comments on the user list, I recently got digging
> with http redirects then stumbled across NUTCH-1042. Although these are
> individual issues e.g. redirects and crawl delays, I think they are
> certainly linked, however what is interesting is that users 'usually'
> don't consider them to be interlinked as such and therefore struggle to
> debug how and why either the redirect or the crawl delay pages are not
> being fetched.
>
> Doing some more digging I found the now rather old and tatty NUTCH-475,
> which obviously got me thinking about how we maintain the
> AdaptiveFetchSchedule for custom refetching. Now I begin to start
> thinking about the following
>
> - Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
> still needs fixed as this is obviously becoming a bit of a pain for some
> users.

Yes.

> - Can someone shine some light on what happened to Fetcher2.java that
> Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)

Fetcher2 is the current Fetcher. The original Fetcher was temporarily 
renamed OldFetcher and then removed.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com