You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Matt Kangas <ka...@gmail.com> on 2006/03/14 02:33:43 UTC

[proposal] catching session-id urls

Hi nutch-dev,

I know that we have RegexUrlNormalizer already for removing session- 
ids from URLs, but lately I've been wondering if there isn't a more  
general way to solve this, without relying on pre-built patterns.

I think I have an answer that will work. I haven't seen this approach  
published anywhere, so any failings are entirely my fault. ;) What  
I'm wondering is:
- Does this seem like a good (effective, efficient) algorithm for  
catching session-id URLs?
- If so, where is the best place to implement it within Nutch?

Basic idea: session ids within URLs only cause problems for crawlers  
when they change. This typically occurs when a server-side session  
expires and a new id is issued. So, rather than looking for URL  
argument patterns (as RegexUrlNormalizer does), look for a value- 
transition pattern.

Algorithm:

1) Iterate over each page in a fetched segment

2) For each successful fetch, extract:
  - The fetched URL. Call this (u0)
  - All links on the page that refer to the same site/domain. Call  
this set (u1..N)

3) Parse u0 into parameters (p0) as follows:
  - named parameters: add (key,value) to Map
  - positional (path) params: add (position,value) to Map

So for the url "http://foo.bar/spam/eggs?x=true&y=2", pseudocode  
would look like:
  p0 = new HashMap();
  p0.put(new Integer(1), "spam");
  p0.put(new Integer(2), "eggs");
  p0.put("x", "true");
  p0.put("y", "2");

4) Parse u1..N into (p1..N) using the same method

5) Compare p0 with p1..N. Look for the following pattern:
  - keys that are present for all p0..N, and
  - values that are identical for all p1..N, and
  - the value in p0 is _different_

If you see this condition, flag the page as "contains session id that  
just changed" and deal with it accordingly. (Delete from crawldb, etc)

So... for anyone who's still reading ;), does this seem like it would  
work for catching session-ids? What corner-cases would trip it up?  
Can you think of cases when it would fall flat? And if it still seems  
worthwhile, where's the best place within Nutch to put it? (Perhaps a  
new ExtensionPoint that is used by "nutch updatedb"?)

--Matt

--
Matt Kangas / kangas@gmail.com

Re: [proposal] catching session-id urls

Posted by Matt Kangas <ka...@gmail.com>.

Thanks for the quick feedback, Stefan! :) Just to clarify the one  
example you provided:

> Also you plugin need to be a little bit smarter than just look for  
> different values, since
> foo.com/cms.do?page=1 and foo.com/cms.do?page=2 are two different  
> pages.

The filter would flag this only if _every_ link returned in the page  
that referred to foo.com contained "page=2". All you need is a single  
link to different page and the filter will let this page pass.

The only case I can really see this happening is if there was no  
navigation back at all (to the homepage, etc). In this case, it's  
likely a recursive trap anyway.

--Matt

On Mar 13, 2006, at 8:43 PM, Stefan Groschupf wrote:

> Hi Matt,
>
> sounds very interesting. Having a extension point until updating  
> would be great since it would also allow to implement plugins  
> dealing with identically pages but different meta data, since we  
> have not the ability of meta data.
> So instead of design the api that we give a old and a new  
> mapWritable we can process a crawlDatum from CrawlDb and a new one  
> form the fetch process.
>
> I see following problems:
> Actually the url is used as key so you can not find a old and a new  
> crawlDatum based on the url in case there are dynamically parameters.
> Also you plugin need to be a little bit smarter than just look for  
> different values, since
> foo.com/cms.do?page=1 and foo.com/cms.do?page=2 are two different  
> pages.
>
> What I can imaging is using normalized urls to find identically  
> pages. Store the parameters as meta data in the crawlDatum, have a  
> plugin that is able to process the crawlDatum from crawlDb and the  
> one from the segment until database update reducing.
>
> Just my 2 cents.
> Greetings,
> Stefan
>
> Am 14.03.2006 um 02:33 schrieb Matt Kangas:
>
>> Hi nutch-dev,
>>
>> I know that we have RegexUrlNormalizer already for removing  
>> session-ids from URLs, but lately I've been wondering if there  
>> isn't a more general way to solve this, without relying on pre- 
>> built patterns.
>>
>> I think I have an answer that will work. I haven't seen this  
>> approach published anywhere, so any failings are entirely my  
>> fault. ;) What I'm wondering is:
>> - Does this seem like a good (effective, efficient) algorithm for  
>> catching session-id URLs?
>> - If so, where is the best place to implement it within Nutch?
>>
>> Basic idea: session ids within URLs only cause problems for  
>> crawlers when they change. This typically occurs when a server- 
>> side session expires and a new id is issued. So, rather than  
>> looking for URL argument patterns (as RegexUrlNormalizer does),  
>> look for a value-transition pattern.
>>
>> Algorithm:
>>
>> 1) Iterate over each page in a fetched segment
>>
>> 2) For each successful fetch, extract:
>>  - The fetched URL. Call this (u0)
>>  - All links on the page that refer to the same site/domain. Call  
>> this set (u1..N)
>>
>> 3) Parse u0 into parameters (p0) as follows:
>>  - named parameters: add (key,value) to Map
>>  - positional (path) params: add (position,value) to Map
>>
>> So for the url "http://foo.bar/spam/eggs?x=true&y=2", pseudocode  
>> would look like:
>>  p0 = new HashMap();
>>  p0.put(new Integer(1), "spam");
>>  p0.put(new Integer(2), "eggs");
>>  p0.put("x", "true");
>>  p0.put("y", "2");
>>
>> 4) Parse u1..N into (p1..N) using the same method
>>
>> 5) Compare p0 with p1..N. Look for the following pattern:
>>  - keys that are present for all p0..N, and
>>  - values that are identical for all p1..N, and
>>  - the value in p0 is _different_
>>
>> If you see this condition, flag the page as "contains session id  
>> that just changed" and deal with it accordingly. (Delete from  
>> crawldb, etc)
>>
>> So... for anyone who's still reading ;), does this seem like it  
>> would work for catching session-ids? What corner-cases would trip  
>> it up? Can you think of cases when it would fall flat? And if it  
>> still seems worthwhile, where's the best place within Nutch to put  
>> it? (Perhaps a new ExtensionPoint that is used by "nutch updatedb"?)
>>
>> --Matt
>>
>> --
>> Matt Kangas / kangas@gmail.com
>>
>>
>>
>
> ---------------------------------------------
> blog: http://www.find23.org
> company: http://www.media-style.com
>
>

--
Matt Kangas / kangas@gmail.com

Re: [proposal] catching session-id urls

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Matt,

sounds very interesting. Having a extension point until updating  
would be great since it would also allow to implement plugins dealing  
with identically pages but different meta data, since we have not the  
ability of meta data.
So instead of design the api that we give a old and a new mapWritable  
we can process a crawlDatum from CrawlDb and a new one form the fetch  
process.

I see following problems:
Actually the url is used as key so you can not find a old and a new  
crawlDatum based on the url in case there are dynamically parameters.
Also you plugin need to be a little bit smarter than just look for  
different values, since
foo.com/cms.do?page=1 and foo.com/cms.do?page=2 are two different pages.

What I can imaging is using normalized urls to find identically  
pages. Store the parameters as meta data in the crawlDatum, have a  
plugin that is able to process the crawlDatum from crawlDb and the  
one from the segment until database update reducing.

Just my 2 cents.
Greetings,
Stefan

Am 14.03.2006 um 02:33 schrieb Matt Kangas:

> Hi nutch-dev,
>
> I know that we have RegexUrlNormalizer already for removing session- 
> ids from URLs, but lately I've been wondering if there isn't a more  
> general way to solve this, without relying on pre-built patterns.
>
> I think I have an answer that will work. I haven't seen this  
> approach published anywhere, so any failings are entirely my  
> fault. ;) What I'm wondering is:
> - Does this seem like a good (effective, efficient) algorithm for  
> catching session-id URLs?
> - If so, where is the best place to implement it within Nutch?
>
> Basic idea: session ids within URLs only cause problems for  
> crawlers when they change. This typically occurs when a server-side  
> session expires and a new id is issued. So, rather than looking for  
> URL argument patterns (as RegexUrlNormalizer does), look for a  
> value-transition pattern.
>
> Algorithm:
>
> 1) Iterate over each page in a fetched segment
>
> 2) For each successful fetch, extract:
>  - The fetched URL. Call this (u0)
>  - All links on the page that refer to the same site/domain. Call  
> this set (u1..N)
>
> 3) Parse u0 into parameters (p0) as follows:
>  - named parameters: add (key,value) to Map
>  - positional (path) params: add (position,value) to Map
>
> So for the url "http://foo.bar/spam/eggs?x=true&y=2", pseudocode  
> would look like:
>  p0 = new HashMap();
>  p0.put(new Integer(1), "spam");
>  p0.put(new Integer(2), "eggs");
>  p0.put("x", "true");
>  p0.put("y", "2");
>
> 4) Parse u1..N into (p1..N) using the same method
>
> 5) Compare p0 with p1..N. Look for the following pattern:
>  - keys that are present for all p0..N, and
>  - values that are identical for all p1..N, and
>  - the value in p0 is _different_
>
> If you see this condition, flag the page as "contains session id  
> that just changed" and deal with it accordingly. (Delete from  
> crawldb, etc)
>
> So... for anyone who's still reading ;), does this seem like it  
> would work for catching session-ids? What corner-cases would trip  
> it up? Can you think of cases when it would fall flat? And if it  
> still seems worthwhile, where's the best place within Nutch to put  
> it? (Perhaps a new ExtensionPoint that is used by "nutch updatedb"?)
>
> --Matt
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com