You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by "Ricardo J. Méndez" <me...@gmail.com> on 2007/02/21 16:11:37 UTC

Customizing crawling questions

Hi,

I've got a few questions about customizing the crawling process.  I
tried checking out the Wiki, but many of the pages linked from
"Becoming a Nutch Developer" are still unwritten, so any pointers you
can provide would be very welcome.


1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
it to follow some <link /> references from the page's Header.  I know
that I can obtain the link reference with a Parse plugin, but how should
I add the reference to the list of items to be crawled?

2) Which type of plugin or response from one - if any - determines what
items go into the database?  For instance, can I write a plugin that
returns "false" if I don't want the database to store a PDF, or a Word
document?  Or maybe a specific page, based on something found by a Parse
plugin?

Thanks in advance,



Ricardo J. Méndez
http://ricardo.strangevistas.net/