You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by jay jiang <jj...@bbn.com> on 2006/02/16 20:07:44 UTC

not indexing path names

I am crawling an intranet.  Apparently Nutch also indexes the url path 
names (as a document) as it crawls.  So if a query word appears in the 
path name,  the entire url path  name would be one result.  Since this 
kind of info would typically be of no value to users, I want to filter 
them out. 

I think we have to crawl them since we need to get the actual document 
urls underneath the path.  But we do not want to index them.  Is there 
anyway to configure not to index path names during the crawling step?  
If not, can we configure it in the search step?  I know we can always 
filter it using getDetails().  But this seems not a very clean way.

Thanks,
--Jay