You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dan sutton <da...@gmail.com> on 2012/01/30 14:47:20 UTC

Re-crawling and multiple fetchers

Hi,

Being new to Nutch, I'm unsure how Nutch deals with fetching more
URL's for a site that is currently being fetched.

e.g. if we inject -> generate -> fetch, then whilst fetching we want
to add more URL's for potentially the same sites currently being
fetched, what's in place to ensure that we continue to adhere to the
politeness policy. As I understand with the initial fetch, all URL's
for the same site are partitioned into the same segment, though I'm
unsure what might maintain this for the second.

Perhaps it's as simple as only allowing 1 fetch process / cluster at a
time, and then limit the number of URL's / host to ensure the second
can start in a timely manner?

Thanks for your help,
Dan

Re: Re-crawling and multiple fetchers

Posted by Markus Jelsma <ma...@openindex.io>.
Ah i see. You do not want to use generate twice in one cycle without doing an 
update. You can use the freegen tool to generate a segments from seed files.

You must maintain a clean cycle of generate, fetch, update. There's an option 
to rebuild the DB after generate but it's another costly job and i would not 
encourage to use it.

On Tuesday 31 January 2012 09:43:46 dan sutton wrote:
> Hi Markus,
> 
> I was thinking what happens if we launch one fetch job fetching URL's
> from HostA (amongst others), then later we run another, seperate,
> fetch job for URL's from HostA.
> 
> i.e.
> 
> -add URL's including HostA
> -inject -> generate -> fetch
> -add more URL's including HostA
> -inject -> generate -> fetch
> 
> Is there anything in place to enforce the politeness policy in this case?
> Do we wait for the first to finish? and ensure this happens in a
> timely manner  (e.g. topN URL's) ?
> 
> Many thanks for your advice,
> Dan
> 
> 
> On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> > This is during fetching? Just create a new FetchItem and add it to the
> > queue:
> > 
> > FetchItem fit = FetchItem.create(new Text("http://url"), new
> > CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode);
> > fetchQueues.addFetchItem(fit);
> > 
> >> Hi,
> >> 
> >> Being new to Nutch, I'm unsure how Nutch deals with fetching more
> >> URL's for a site that is currently being fetched.
> >> 
> >> e.g. if we inject -> generate -> fetch, then whilst fetching we want
> >> to add more URL's for potentially the same sites currently being
> >> fetched, what's in place to ensure that we continue to adhere to the
> >> politeness policy. As I understand with the initial fetch, all URL's
> >> for the same site are partitioned into the same segment, though I'm
> >> unsure what might maintain this for the second.
> >> 
> >> Perhaps it's as simple as only allowing 1 fetch process / cluster at a
> >> time, and then limit the number of URL's / host to ensure the second
> >> can start in a timely manner?
> >> 
> >> Thanks for your help,
> >> Dan

-- 
Markus Jelsma - CTO - Openindex

Re: Re-crawling and multiple fetchers

Posted by dan sutton <da...@gmail.com>.
Hi Markus,

I was thinking what happens if we launch one fetch job fetching URL's
from HostA (amongst others), then later we run another, seperate,
fetch job for URL's from HostA.

i.e.

-add URL's including HostA
-inject -> generate -> fetch
-add more URL's including HostA
-inject -> generate -> fetch

Is there anything in place to enforce the politeness policy in this case?
Do we wait for the first to finish? and ensure this happens in a
timely manner  (e.g. topN URL's) ?

Many thanks for your advice,
Dan


On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma
<ma...@openindex.io> wrote:
> This is during fetching? Just create a new FetchItem and add it to the queue:
>
> FetchItem fit = FetchItem.create(new Text("http://url"), new
> CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode);
> fetchQueues.addFetchItem(fit);
>
>> Hi,
>>
>> Being new to Nutch, I'm unsure how Nutch deals with fetching more
>> URL's for a site that is currently being fetched.
>>
>> e.g. if we inject -> generate -> fetch, then whilst fetching we want
>> to add more URL's for potentially the same sites currently being
>> fetched, what's in place to ensure that we continue to adhere to the
>> politeness policy. As I understand with the initial fetch, all URL's
>> for the same site are partitioned into the same segment, though I'm
>> unsure what might maintain this for the second.
>>
>> Perhaps it's as simple as only allowing 1 fetch process / cluster at a
>> time, and then limit the number of URL's / host to ensure the second
>> can start in a timely manner?
>>
>> Thanks for your help,
>> Dan

Re: Re-crawling and multiple fetchers

Posted by Markus Jelsma <ma...@openindex.io>.
This is during fetching? Just create a new FetchItem and add it to the queue:

FetchItem fit = FetchItem.create(new Text("http://url"), new 
CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode);
fetchQueues.addFetchItem(fit);

> Hi,
> 
> Being new to Nutch, I'm unsure how Nutch deals with fetching more
> URL's for a site that is currently being fetched.
> 
> e.g. if we inject -> generate -> fetch, then whilst fetching we want
> to add more URL's for potentially the same sites currently being
> fetched, what's in place to ensure that we continue to adhere to the
> politeness policy. As I understand with the initial fetch, all URL's
> for the same site are partitioned into the same segment, though I'm
> unsure what might maintain this for the second.
> 
> Perhaps it's as simple as only allowing 1 fetch process / cluster at a
> time, and then limit the number of URL's / host to ensure the second
> can start in a timely manner?
> 
> Thanks for your help,
> Dan