You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Karen Church <ka...@ucd.ie> on 2005/06/08 12:06:24 UTC

Newbie Question - Nutch Functionality

Hi All,

I've just recently started using Nutch and so far so good : ) I have a few questions about the functionality provided by Nutch and was wondering if any of you could help?

1. I have ran a couple of whole-web crawling tests with a few urls and noticed that the actual pages/files are not saved/downloaded. I understand that the content of the pages crawled is parsed and extracted. Is there anyway to configure Nutch so that it downloads the files as well as carrying out the parsing/information extraction?

2. I want use Nutch to run an initial web crawl and then after a certain interval be it days/weeks, re-run the web crawl but this time log new pages added since the last crawl, pages that have been deleted or removed since the last crawl and changes to existing pages in the database. Does Nutch provide such functionality? If so, does any one have any pointers / is there any existing documentation that will help me get started?

3. I also want to crawl only specific content types - For example, can Nutch be configured so that it crawls only pdf files or xml files from a web site instead of everything from the site?

4. I understand that Nutch uses Lucene for its indexing requirements. Is it possible to crawl pages using Nutch and then implement a separate search strategy using Lucene? Is it relatively straightforward to hook up the two?

Any help you can provide me with regarding any of the questions above would be much appreciated.

Thanks in advance,

Karen