You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alex McLintock <al...@gmail.com> on 2010/06/29 11:37:54 UTC

http caching proxy?

One thing I don't really understand about running Nutch.

If I am doing several topical crawls - or perhaps crawls constrained
to a number of sites I will be fetching the same page several times.
It would obviously be polite to not fetch the same page twice.

Now one way I have seen is to use some kind of http caching proxy
between your nutch/hadoop crawl and the outside world. But that kind
of defeats the point of using Nutch if the proxy is all on one big
box.

Does nutch do anything like that itself? As far as I can see it only
really stores the processed documents - not the originally fetched
ones. Each new crawl is effectively a new crawl - ignoring whether or
not any pages were fetched before.



PS I found this Jira issue to refetch only new pages. Is this
available in the release?

https://issues.apache.org/jira/browse/NUTCH-49


Alex

Re: http caching proxy?

Posted by reinhard schwab <re...@aon.at>.

Andrzej Bialecki schrieb:
> On 2010-06-29 11:37, Alex McLintock wrote:
>   
>> One thing I don't really understand about running Nutch.
>>
>> If I am doing several topical crawls - or perhaps crawls constrained
>> to a number of sites I will be fetching the same page several times.
>>     
>
> Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is
> no mechanism in Nutch to prevent this.
>
>   
>> It would obviously be polite to not fetch the same page twice.
>>     
>
> If you use a single CrawlDb the same page won't appear on fetchlists
> multiple times, because the first time that you run CrawlDb update it
> already records that it was fetched.
>   
i have reported a bug some time ago.
https://issues.apache.org/jira/browse/NUTCH-774
the provided patch has not been integrated into nutch 1.1 release.
it may happen that the retry interval is set to 0 and the same page is
fetched again and again.

>   
>> Now one way I have seen is to use some kind of http caching proxy
>> between your nutch/hadoop crawl and the outside world. But that kind
>> of defeats the point of using Nutch if the proxy is all on one big
>> box.
>>
>> Does nutch do anything like that itself? As far as I can see it only
>> really stores the processed documents - not the originally fetched
>> ones. Each new crawl is effectively a new crawl - ignoring whether or
>> not any pages were fetched before.
>>     
>
> By default Nutch stores everything in segments - both raw pages, parsed
> text, metadata, outlinks, etc. Page status is maintained in crawldb. If
> you use the same crawldb to generate/fetch/parse/update then CrawlDb is
> the place that remembers what pages have been fetched and which ones to
> schedule for re-fetching.
>
>   
>> PS I found this Jira issue to refetch only new pages. Is this
>> available in the release?
>>
>> https://issues.apache.org/jira/browse/NUTCH-49
>>     
>
> Yes. See AdaptiveFetchSchedule class for details.
>
>

Re: http caching proxy?

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-06-29 11:37, Alex McLintock wrote:
> One thing I don't really understand about running Nutch.
> 
> If I am doing several topical crawls - or perhaps crawls constrained
> to a number of sites I will be fetching the same page several times.

Do you mean the crawls that use disjoint CrawlDb-s? Then yes, there is
no mechanism in Nutch to prevent this.

> It would obviously be polite to not fetch the same page twice.

If you use a single CrawlDb the same page won't appear on fetchlists
multiple times, because the first time that you run CrawlDb update it
already records that it was fetched.

> 
> Now one way I have seen is to use some kind of http caching proxy
> between your nutch/hadoop crawl and the outside world. But that kind
> of defeats the point of using Nutch if the proxy is all on one big
> box.
> 
> Does nutch do anything like that itself? As far as I can see it only
> really stores the processed documents - not the originally fetched
> ones. Each new crawl is effectively a new crawl - ignoring whether or
> not any pages were fetched before.

By default Nutch stores everything in segments - both raw pages, parsed
text, metadata, outlinks, etc. Page status is maintained in crawldb. If
you use the same crawldb to generate/fetch/parse/update then CrawlDb is
the place that remembers what pages have been fetched and which ones to
schedule for re-fetching.

> PS I found this Jira issue to refetch only new pages. Is this
> available in the release?
> 
> https://issues.apache.org/jira/browse/NUTCH-49

Yes. See AdaptiveFetchSchedule class for details.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com