You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Steven Wichers <st...@devnet.com> on 2010/02/24 18:09:01 UTC

Crawling site, but only indexing certain pages

On some of the sites I want to index with nutch, there are only
specific types of pages I would like to be searchable. I need a way to
be able to crawl these sites, but only index pages that match a
certain regular expression.

ex:

www.example.com/browse/ finds links in the form of
www.example.com/items/1234.html and
www.example.com/items/browse_by_xyz.html . I need to be able to index
just the www.example.com/items/1234.html style links while still
crawling the browse_by_xyz.html style links.

>From my searching I thought that I could use crawl-urlfilter.txt to
restrict where Nutch crawled, and regex-urlfilter.txt to restrict what
was actually indexed. This did not seem to work, so I was either
misinformed or implemented it correctly.

Does Nutch have the capability I am looking for?

Re: Crawling site, but only indexing certain pages

Posted by Magnús Skúlason <ma...@gmail.com>.
Hi,

This is actually very easy, just create a indexing plugging, analyse the url
format and return null from the indexing pluggin if you don't want to index
it.

best regards,
Magnus

On Wed, Feb 24, 2010 at 6:09 PM, Steven Wichers <st...@devnet.com> wrote:

> On some of the sites I want to index with nutch, there are only
> specific types of pages I would like to be searchable. I need a way to
> be able to crawl these sites, but only index pages that match a
> certain regular expression.
>
> ex:
>
> www.example.com/browse/ finds links in the form of
> www.example.com/items/1234.html and
> www.example.com/items/browse_by_xyz.html . I need to be able to index
> just the www.example.com/items/1234.html style links while still
> crawling the browse_by_xyz.html style links.
>
> From my searching I thought that I could use crawl-urlfilter.txt to
> restrict where Nutch crawled, and regex-urlfilter.txt to restrict what
> was actually indexed. This did not seem to work, so I was either
> misinformed or implemented it correctly.
>
> Does Nutch have the capability I am looking for?
>