You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jay jiang <jj...@bbn.com> on 2006/02/16 20:07:44 UTC
not indexing path names
I am crawling an intranet. Apparently Nutch also indexes the url path
names (as a document) as it crawls. So if a query word appears in the
path name, the entire url path name would be one result. Since this
kind of info would typically be of no value to users, I want to filter
them out.
I think we have to crawl them since we need to get the actual document
urls underneath the path. But we do not want to index them. Is there
anyway to configure not to index path names during the crawling step?
If not, can we configure it in the search step? I know we can always
filter it using getDetails(). But this seems not a very clean way.
Thanks,
--Jay