You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2014/05/21 23:29:34 UTC

Re: crawl every 24 hours

Hi,

Another way of doing this is to increase 

db.fetch.interval.default

to x years and inject each time the original seed. In this way you will fetch only new pages during x year, since injected urls fetch time is set to current time (I believe, you can double check it first) and the other fetched pages only will be picked up after x year.

HTH.
Alex 



 

 

 

-----Original Message-----
From: Julien Nioche <li...@gmail.com>
To: user <us...@nutch.apache.org>; Ali rahmani <al...@yahoo.com>
Sent: Wed, May 21, 2014 7:14 am
Subject: Re: Re-crawl every 24 hours


<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
</property>

means that a page which has already been fetched will be refetched again
after 30mins. This is what you want for the seeds but is also applied to
the subpages you've already discovered in previous rounds.

What you could do would be to set a custom fetch interval for the seeds
only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
This way the seeds would be revisited frequently but not the subpages. Note
that this would work only if the links to the pages you want to discover
are directly in the seed files. If they are at a deeper level then they'd
be discovered only when the page that mentions them is re-fetched (==
nutch.fetchInterval)

HTH

Julien


On 21 May 2014 11:22, Ali rahmani <al...@yahoo.com> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble