You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/09/28 13:48:06 UTC
regex-normalize - Re: SessionIDs and forums are killing my fetch

I thought this could be done via regex-normalize?  It is my preference 
to use functionality/features of the confuguration rather than 
maintaining a local patch.

-j

Jack Tang wrote:
> Hi Jon
> 
> Please can see detail in getOutlinks() method in DOMContentUtils class
> of parse-html plugin.
> 
> you can revise the URLs before
> 
> outlinks.add(new Outlink(url.toString(), linkText
>                                     .toString().trim()));
> 
> Hope it helps
> 
> Regards
> /Jack
> 
> On 9/28/05, Gal Nitzan <gn...@usa.net> wrote:
> 
>>Hi Jack,
>>
>>How can you discard URL from fetchlist?
>>
>>Regards,
>>Gal
>>
>>Jack Tang wrote:
>>
>>>Hi Jon
>>>
>>>I think you can revise the URL by discarding "sid" param before
>>>putting it into fetchlist.
>>>
>>>Regards
>>>/Jack
>>>
>>>On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
>>>
>>>
>>>>Gal Nitzan wrote:
>>>>
>>>>
>>>>>Jon Shoberg wrote:
>>>>>
>>>>>
>>>>>
>>>>>>I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>>>>Its a phpBB which uses a question mark in the URL and sid.
>>>>>>
>>>>>>What have other people done to crawl forums and minimze duplicates?
>>>>>>These are ones that dedup is not catching.
>>>>>>
>>>>>>Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>>>>open the source and see...
>>>>>>
>>>>>>These URLs look like and appear to have the same content to the user:
>>>>>>
>>>>>>http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>>>>http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>>>>http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>>>
>>>>>>
>>>>>>Below is my regex normalize file:
>>>>>>
>>>>>><?xml version="1.0"?>
>>>>>><!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>>>     This is intended so that users can specify substitutions to be
>>>>>>     done on URLs. The regex engine that is used is Perl5 compatible.
>>>>>>     The rules are applied to URLs in the order they occur in this
>>>>>>file.  -->
>>>>>>
>>>>>><!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>>>     expanded to &amp; -->
>>>>>>
>>>>>><!-- The following rules show how to strip out session IDs
>>>>>>     that are 32 characters long and have the parameter
>>>>>>     name of PHPSESSID. Order does matter!  -->
>>>>>><regex-normalize>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>></regex-normalize>
>>>>>>
>>>>>>.
>>>>>>
>>>>>>
>>>>>
>>>>>Hi Jon,
>>>>>
>>>>>I'm not sure if the normalize file is the correct place, I use the
>>>>>regex-urlfiter.xml with the following:
>>>>>
>>>>>-(session|Session|SESS|sid)
>>>>>
>>>>>I know it might leave a url like obsession.url out, but it is better
>>>>>than your fetcher running in circles :-)
>>>>>
>>>>>Hope it helps,
>>>>>
>>>>>Gal
>>>>>
>>>>
>>>>Yes,
>>>>
>>>>   Better than circiles but I'm looking to refine the config to allow
>>>>for this, not just avoid them.
>>>>
>>>>-j
>>>>
>>>>
>>>
>>>
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>.
>>>
>>>
>>
>>
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars