You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Wallace <da...@nzqa.govt.nz> on 2006/02/19 23:40:08 UTC

Storing redirections in segment

Hi all,
I'm running a fairly old build of 0.7, so please accept my apologies if
what I'm describing has been changed in a later release.
 
It seems that if a URL gets redirected during a crawl, then it's the
original URL, not the redirected version, that gets stored in the
segment and indexed.  I'm wondering whether this is the "correct"
behaviour, for a couple of reasons.
(1) If the redirection occurs because a page has moved, then maybe the
redirect page will be removed shortly.  In this case, surely I want the
new URL, not the old URL in my index?
(2) If I have multiple URLs that all redirect to the same target, and I
store them all in my segment, then when the user comes to do a search,
they'll see multiple results that are all actually the same page,
addressed via different URLs.  This is absolutely NOT what the user will
want to see.
 
I'm wondering whether there are:
- good reasons why the behaviour is the way it is;
- hidden consequences that are going to bite me, if I change the
behaviour in my own installation.
 
Advice please?
 
Kind regards,
David.

********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************