You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raymond Balmès <ra...@gmail.com> on 2009/02/28 15:53:03 UTC
newbie: filterin with regex
Hi all,
I'm new to nutch internals... my project is the following in very short
index webpages and only those that contain a specific regex (supplied by
me). The regex extract specific attributes which will be used later for
efficient search
I envisionned the following changes, do you guys think it goes in the right
direction or is there a more intelligent way. The plug-in technique did not
seem to fit.
1. modifiy the outlink extractor in two ways
- return an array of matches of my regex
- return outlinks only if my regex matches
2. modify the indexer to use the regex match attribute
- do not index pages with no matches
3. modify the search engine to use the matches attribute
Thanks for your answers !