You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Suhail Ahmed <il...@mac.com> on 2005/05/31 21:00:16 UTC

Re: recrawling sites and querying dates

Thanks Charman,

How do I go about changing the 30 days to 1 day since I intend to  
recrawl on a daily basis?

I am also using index-more to do the indexing, Could someone tell me  
how to construct a query using the "lastModified" field. Since I am  
recrawling on a daily basis I am hoping to use the current day and  
current day -1 to retrieve the results as a function of date.

Thanks for the help

Suhail


On May 30, 2005, at 8:44 PM, Chirag Chaman wrote:

> Suhail,
>
> The default nutch crawl process already does this. It will refetch  
> pages
> every 30 days.
> Look at the nutch Wiki and documentation. To recrawl the links  
> specify the
> link depth.
>
> CC-
>
> --------------------------------------------
> Filangy, Inc.
> Interested in Improving Search? Join our Team!
> http://filangy.com/jointheteam.jsp
>
>
> -----Original Message-----
> From: Suhail Ahmed [mailto:ilyanov@mac.com]
> Sent: Monday, May 30, 2005 12:44 PM
> To: nutch-user@incubator.apache.org
> Subject: recrawling sites
>
> Hi,
>
> How do I go about recrawling websites? Essentially I want to repeat  
> the
> following tasks repeatedly:
>
> [one off task] inject the database with a url list
>
> 1. create a segment with the initial list 2. fetch the segment 3.  
> update the
> database 4. create a new segment with the outlinks from [2] 5.  
> fetch the
> segement created in [4].
>
> I basically want to repeat steps 2 through 5. How would I do this?
>
> Thanks for the help
>
> Suhail
>
>
>

RE: recrawling sites and querying dates

Posted by Chirag Chaman <de...@filangy.com>.

I like the new first+last fused name you have going for me...

Basically, to fetch again tomorrow, you need to set the next fetch date to
the next date. This can be achieved by changing the fetch interval, which is
currently 30.

>From nutch-default.xml (you should copy this to nutch-site.xml and make the
change in that one. Leave the default file as is. Nutch-site overrides
nuth-default)
<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property> 


-----Original Message-----
From: Suhail Ahmed [mailto:ilyanov@mac.com] 
Sent: Tuesday, May 31, 2005 3:00 PM
To: nutch-user@incubator.apache.org
Subject: Re: recrawling sites and querying dates

Thanks Charman,

How do I go about changing the 30 days to 1 day since I intend to recrawl on
a daily basis?

I am also using index-more to do the indexing, Could someone tell me how to
construct a query using the "lastModified" field. Since I am recrawling on a
daily basis I am hoping to use the current day and current day -1 to
retrieve the results as a function of date.

Thanks for the help

Suhail


On May 30, 2005, at 8:44 PM, Chirag Chaman wrote:

> Suhail,
>
> The default nutch crawl process already does this. It will refetch 
> pages every 30 days.
> Look at the nutch Wiki and documentation. To recrawl the links specify 
> the link depth.
>
> CC-
>
> --------------------------------------------
> Filangy, Inc.
> Interested in Improving Search? Join our Team!
> http://filangy.com/jointheteam.jsp
>
>
> -----Original Message-----
> From: Suhail Ahmed [mailto:ilyanov@mac.com]
> Sent: Monday, May 30, 2005 12:44 PM
> To: nutch-user@incubator.apache.org
> Subject: recrawling sites
>
> Hi,
>
> How do I go about recrawling websites? Essentially I want to repeat  
> the
> following tasks repeatedly:
>
> [one off task] inject the database with a url list
>
> 1. create a segment with the initial list 2. fetch the segment 3.  
> update the
> database 4. create a new segment with the outlinks from [2] 5.  
> fetch the
> segement created in [4].
>
> I basically want to repeat steps 2 through 5. How would I do this?
>
> Thanks for the help
>
> Suhail
>
>
>