You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jgelb <jg...@pearsoncmg.com> on 2007/11/09 22:13:04 UTC

crawl on non-standard port, index/search on port 80?

All:

I'm looking for a sanity check before I embark on something.

The web servers I want to index are visible to the public on port 80 via
some load balancing equipment, but don't themselves listen on port 80.
That is, from the public perspective they look like http://foo.bar.com/,
but privately/internally, they're actually http://foo.bar.com:9000/

I need my nutch server to crawl via the internal/private port, not the
port 80 public one.  When people get search results, however, I want the
urls in the results to reference the port 80 only (foo.bar.com) and not
expose the internal port.

I can crawl on the alternate port just fine, but it's exposed in the
search results.

It looks like BasicIndexingFilter is where the url is added as a field
in the index.  I'm thinking that I need to create my own variant that
removes the unwanted port info before adding the url, but I'm afraid
that it may cause problems when it's time to reindex things.

An alternative might be to modify the url before displaying my search
results.  I'd prefer to be able to simply serve the results w/o having
to change them -- seems like it would scale better that way.

Any suggestions?  Did I miss something?

thx.

-- jeff