You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/03/02 06:19:34 UTC
different fetch interval for each depth urls
Hello,
I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue.
Thanks in advance.
Alex.
Re: different fetch interval for each depth urls
Posted by Markus Jelsma <ma...@openindex.io>.
On Fri, 2 Mar 2012 14:32:48 -0500 (EST), alxsss@aim.com wrote:
> I need to make this as a cron job, so cannot do changes manually.
> My problem is to index newspaper sites, but only new links that are
> added every day and not fetch ones that have already been fetched.
>
I see. Trunk can generate records restricted by status:
generate -Dgenerate.restrict.status=<status>
> Thanks.
> Alex.
>
> -----Original Message-----
> From: Markus Jelsma
> To: user
> Cc: nutch-user
> Sent: Thu, Mar 1, 2012 10:30 pm
> Subject: Re: different fetch interval for each depth urls
>
> Well, you could set a new default fetch interval in your
> configuration
> after the first crawl cycle but the depth information is lost if you
>
> continue crawling so there is no real solution.
>
> What problem are you trying to solve anyway?
>
> On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alxsss@aim.com [1] wrote:
>> Hello,
>>
>> I need to have different fetch intervals for initial seed urls and
>> urls extracted from them at depth 1. How this can be achieved. I
>> tried
>> -adddays option in generate command but it seems it cannot be used
> to
>> solve this issue.
>>
>> Thanks in advance.
>> Alex.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
Re: different fetch interval for each depth urls
Posted by al...@aim.com.
I need to make this as a cron job, so cannot do changes manually.
My problem is to index newspaper sites, but only new links that are added every day and not fetch ones that have already been fetched.
Thanks.
Alex.
-----Original Message-----
From: Markus Jelsma <ma...@openindex.io>
To: user <us...@nutch.apache.org>
Cc: nutch-user <nu...@lucene.apache.org>
Sent: Thu, Mar 1, 2012 10:30 pm
Subject: Re: different fetch interval for each depth urls
Well, you could set a new default fetch interval in your configuration
after the first crawl cycle but the depth information is lost if you
continue crawling so there is no real solution.
What problem are you trying to solve anyway?
On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alxsss@aim.com wrote:
> Hello,
>
> I need to have different fetch intervals for initial seed urls and
> urls extracted from them at depth 1. How this can be achieved. I
> tried
> -adddays option in generate command but it seems it cannot be used to
> solve this issue.
>
> Thanks in advance.
> Alex.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350
Re: different fetch interval for each depth urls
Posted by Markus Jelsma <ma...@openindex.io>.
Well, you could set a new default fetch interval in your configuration
after the first crawl cycle but the depth information is lost if you
continue crawling so there is no real solution.
What problem are you trying to solve anyway?
On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alxsss@aim.com wrote:
> Hello,
>
> I need to have different fetch intervals for initial seed urls and
> urls extracted from them at depth 1. How this can be achieved. I
> tried
> -adddays option in generate command but it seems it cannot be used to
> solve this issue.
>
> Thanks in advance.
> Alex.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350