You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/11/02 07:05:36 UTC
recrawl sites with a scheduled crawling
hi, i want to re_crawl my sites every hour. i write a script for this. i edit
some properties in nutch-site.xml. but my re_crawler fetches urls only for 3
times an after that it stop fetching. it's mean that my nutch don't update
after 3 hours. this is my changes in nutch-site.xml:
<property>
<name>db.fetch.interval.default</name>
<value>30</value>
<description>The default number of seconds between re-fetches of a page
(30 days).</description>
</property>
<property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule
simply adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>solr.commit.size</name>
<value>10</value>
<description>Defines the number of documents to send to Solr in a single
update batch. Decrease when handling very large documents to prevent Nutch
from running out of memory.</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>36000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.</description>
</property>
--
View this message in context: http://lucene.472066.n3.nabble.com/recrawl-sites-with-a-scheduled-crawling-tp3472961p3472961.html
Sent from the Nutch - User mailing list archive at Nabble.com.