You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Brian Hill <hi...@yosemite.cc.ca.us> on 2006/05/11 23:15:55 UTC

Preventing overlapped search results.

I'm new to Nutch, but I couldn't find this in the archives or docs and
it has me stumped.

I have two websites that I need to index in Nutch. I am presently
running two separate crawls to index these sites, but a single link is
screwing up my search results. 

I have two flat files in my Nutch directory, "Domain1" and "Domain2".
Each of these files contains the appropriate starting URL for each of
the two sites, and the two crawls generate completely separate database
folders, which are in turn called by two independent Nutch frontend
installations in Tomcat.

My problem is with the crawl-urlfilter.txt file. Because this is a local
search, I need to limit the domains and the file contains these lines:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*domain1.edu/
+^http://([a-z0-9]*\.)*domain2.edu/

This would work perfectly EXCEPT that there is a single link on the
domain1.edu site to the homepage of the domain2.edu site. Nutch is
following this link, and as a result the domain1 search results are
bringing up the full domain1.edu AND domain2.edu sites. 

What's the best way to deal with this problem? When I run the Domain1
Nutch search, I need the results to be limited to the domain1.edu,
subdomain1.domain1.edu, and subdomain2.domain1.edu websites. Likewise,
if I add a reciprocal link to domain2.edu, I need users of THAT search
interface to receive results only relevant to that domain.

PLEASE don't tell me I need two independent Nutch installations! Your
help is appreciated.

Brian Hill