You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cook <na...@candiru.com> on 2006/11/29 20:47:17 UTC

Re: Should URL normalization iterate?

Just fixed this (and made some other improvements as well). See:

http://issues.apache.org/jira/browse/NUTCH-410

Hope this is useful; feedback welcome.

-Doug


Neal Richter-3 wrote:
> 
> Doug,
> 
> I think it sounds like a good idea.  It eliminates the need to order the
> rules precisely...
> 
> We don't iterate them in HtDig and it's been on my todo list for a while
> as
> well.
> 
> I would iterate until no matches, some max iteration number, or the URL is
> obviously junk.
> 
> For the max iteration number I would use the number of rewrite rules you
> have.  So if you have 10 rules, you iterate on all 10 rules 10 times. 
> That
> will cover the case where your rules 'chain' in a 10 step sequence.  Sure
> it's an edge case to do that, but I can see rule sets where you construct
> 3-step chains (like swapping strings or something).
> 
> Thanks
> 
> Neal
> 
> On 8/30/06, Doug Cook <na...@candiru.com> wrote:
>>
>>
>> Hi,
>>
>> I've run across a few patterns in URLs where applying a normalization
>> puts
>> the URL in a form matching another normalization pattern (or even the
>> same
>> one). But that pattern won't get executed because the patterns are
>> applied
>> only once.
>>
>> Should normalization iterate until no patterns match (with, perhaps, some
>> limit to the number of iterations to prevent loops from pattern
>> mistakes)?
>>
>> It's a minor problem; it doesn't seem to affect too many URLs for things
>> like session ID removal, since finding two session IDs in the same URL is
>> rare (but does happen -- that's how I noticed this). I could imagine it
>> being much more significant, however, if other Nutch users out there are
>> using "broader" normalization patterns.
>>
>> Any philosophical/practical objections? (it's early, I've only had 1
>> coffee,
>> and I've probably missed something obvious!)
>>
>> I'll file an issue and add it to my queue of things to do if people think
>> its a good idea.
>>
>> -Doug
>> --
>> View this message in context:
>> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
>> Sent from the Nutch - Dev forum at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a7606490
Sent from the Nutch - Dev mailing list archive at Nabble.com.