You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raymond Balmès <ra...@gmail.com> on 2009/02/28 15:53:03 UTC

newbie: filterin with regex

Hi all,

I'm new to nutch internals... my project is the following in very short

index webpages and only those  that contain a specific regex (supplied by
me). The regex extract specific attributes which will be used later for
efficient search

I envisionned the following changes, do you guys think it goes in the right
direction or is there a more intelligent way. The plug-in technique did not
seem to fit.

1. modifiy the outlink extractor in two ways

   - return an array of matches of my regex
   - return outlinks only if my regex matches

2. modify the indexer to use the regex match attribute

   - do not index pages with no matches


3. modify the search engine to use the matches attribute


Thanks for your answers !