You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Po...@acocon.de on 2006/02/14 09:02:30 UTC

intranet crwl update

I will use nutch to search one (!) internet site (example: www.mysite.de).

I am quit new to nutch and checked it out. In the tutorial I found the
intranet crawl chapter.

I think, that is what I need. I followed the example and all works fine and
I can search my site.

My questions:

- How do I update/refresh the index? There is no explanation or example
about the intranet crawl!
- What is the refresh period of the index? And how can I change it?
- What are the meta-tags nutch uses to decide if a page is new or modified?
Or is the entire site recrawled with every update?
- I need to refresh / update the index daily. Is that possible? There are
every day content updates made by users, which I must
- If I deploy the nutch war on an application server, can I update/refresh
the index by a servlet and not using an shell script? We are using an
windows box and I don't want to install cygwin.


Can someone send me an step by step explanation or an script that crawl and
periodicallly refresh / updates the index for one site?

Is there a german out there, who can guide me? My english is not as good as
it should be, you see.




Re: intranet crwl update

Posted by Thomas Delnoij <di...@gmail.com>.
I will try to answer your questions. If I am wrong, I am sure one of the
more experienced developers can correct me ...:)

- How do I update/refresh the index? There is no explanation or example
> about the intranet crawl!


The main index (in crawldir/index) is updated by the CrawlTool after every
cycle.

- What is the refresh period of the index? And how can I change it?


The refresh period of the index (in case you're using the CrawlTool -
otherwise it depends on how often you merge your indexes by hand) is
actually controlled by the db.default.fetch.interval property - the default
number of days between re-fetches of a page. By default this property is set
to 30 days - if you like to change it, copy the property definition from
nutch-default.xml to nutch-site.xml and change accordingly.

- What are the meta-tags nutch uses to decide if a page is new or modified?
> Or is the entire site recrawled with every update?


I don't think Nutch looks at the metatags to decide whether a page should be
refetched or not. The last-modified metatag can be indexed and queried
though; for this to work you need to enable the index-more and query-more
plugins.

- I need to refresh / update the index daily. Is that possible? There are
> every day content updates made by users, which I must


It is certainly possible, I think it mostly depend on how many pages your
site contais and your network/hardware setup, i.e. whether you can
fetch/parse/index all of the pages in one day. Off coure, you have to
db.default.fetch.interval property to value 1.

- If I deploy the nutch war on an application server, can I update/refresh
> the index by a servlet and not using an shell script? We are using an
> windows box and I don't want to install cygwin.


You can do your crawl cycle on a seperate box and when it is done merging
the indexes copy the crawl dir to the box running the app server.

Can someone send me an step by step explanation or an script that crawl and
> periodicallly refresh / updates the index for one site?


This is what the crawltool does - read the Java code of the
org.apache.nutch.tools.CrawlTool and you will get a good idea.

HTH - Thomas