You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cook <na...@candiru.com> on 2006/08/30 16:21:04 UTC

Should URL normalization iterate?

Hi,

I've run across a few patterns in URLs where applying a normalization puts
the URL in a form matching another normalization pattern (or even the same
one). But that pattern won't get executed because the patterns are applied
only once.

Should normalization iterate until no patterns match (with, perhaps, some
limit to the number of iterations to prevent loops from pattern mistakes)?

It's a minor problem; it doesn't seem to affect too many URLs for things
like session ID removal, since finding two session IDs in the same URL is
rare (but does happen -- that's how I noticed this). I could imagine it
being much more significant, however, if other Nutch users out there are
using "broader" normalization patterns.

Any philosophical/practical objections? (it's early, I've only had 1 coffee,
and I've probably missed something obvious!) 

I'll file an issue and add it to my queue of things to do if people think
its a good idea.

-Doug
-- 
View this message in context: http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
Sent from the Nutch - Dev forum at Nabble.com.


Re: Should URL normalization iterate?

Posted by Doug Cook <na...@candiru.com>.
Just fixed this (and made some other improvements as well). See:

http://issues.apache.org/jira/browse/NUTCH-410

Hope this is useful; feedback welcome.

-Doug


Neal Richter-3 wrote:
> 
> Doug,
> 
> I think it sounds like a good idea.  It eliminates the need to order the
> rules precisely...
> 
> We don't iterate them in HtDig and it's been on my todo list for a while
> as
> well.
> 
> I would iterate until no matches, some max iteration number, or the URL is
> obviously junk.
> 
> For the max iteration number I would use the number of rewrite rules you
> have.  So if you have 10 rules, you iterate on all 10 rules 10 times. 
> That
> will cover the case where your rules 'chain' in a 10 step sequence.  Sure
> it's an edge case to do that, but I can see rule sets where you construct
> 3-step chains (like swapping strings or something).
> 
> Thanks
> 
> Neal
> 
> On 8/30/06, Doug Cook <na...@candiru.com> wrote:
>>
>>
>> Hi,
>>
>> I've run across a few patterns in URLs where applying a normalization
>> puts
>> the URL in a form matching another normalization pattern (or even the
>> same
>> one). But that pattern won't get executed because the patterns are
>> applied
>> only once.
>>
>> Should normalization iterate until no patterns match (with, perhaps, some
>> limit to the number of iterations to prevent loops from pattern
>> mistakes)?
>>
>> It's a minor problem; it doesn't seem to affect too many URLs for things
>> like session ID removal, since finding two session IDs in the same URL is
>> rare (but does happen -- that's how I noticed this). I could imagine it
>> being much more significant, however, if other Nutch users out there are
>> using "broader" normalization patterns.
>>
>> Any philosophical/practical objections? (it's early, I've only had 1
>> coffee,
>> and I've probably missed something obvious!)
>>
>> I'll file an issue and add it to my queue of things to do if people think
>> its a good idea.
>>
>> -Doug
>> --
>> View this message in context:
>> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
>> Sent from the Nutch - Dev forum at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a7606490
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Should URL normalization iterate?

Posted by Neal Richter <nr...@gmail.com>.
Doug,

I think it sounds like a good idea.  It eliminates the need to order the
rules precisely...

We don't iterate them in HtDig and it's been on my todo list for a while as
well.

I would iterate until no matches, some max iteration number, or the URL is
obviously junk.

For the max iteration number I would use the number of rewrite rules you
have.  So if you have 10 rules, you iterate on all 10 rules 10 times.  That
will cover the case where your rules 'chain' in a 10 step sequence.  Sure
it's an edge case to do that, but I can see rule sets where you construct
3-step chains (like swapping strings or something).

Thanks

Neal

On 8/30/06, Doug Cook <na...@candiru.com> wrote:
>
>
> Hi,
>
> I've run across a few patterns in URLs where applying a normalization puts
> the URL in a form matching another normalization pattern (or even the same
> one). But that pattern won't get executed because the patterns are applied
> only once.
>
> Should normalization iterate until no patterns match (with, perhaps, some
> limit to the number of iterations to prevent loops from pattern mistakes)?
>
> It's a minor problem; it doesn't seem to affect too many URLs for things
> like session ID removal, since finding two session IDs in the same URL is
> rare (but does happen -- that's how I noticed this). I could imagine it
> being much more significant, however, if other Nutch users out there are
> using "broader" normalization patterns.
>
> Any philosophical/practical objections? (it's early, I've only had 1
> coffee,
> and I've probably missed something obvious!)
>
> I'll file an issue and add it to my queue of things to do if people think
> its a good idea.
>
> -Doug
> --
> View this message in context:
> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
> Sent from the Nutch - Dev forum at Nabble.com.
>
>