You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/09/28 13:48:06 UTC
regex-normalize - Re: SessionIDs and forums are killing my fetch
I thought this could be done via regex-normalize? It is my preference
to use functionality/features of the confuguration rather than
maintaining a local patch.
-j
Jack Tang wrote:
> Hi Jon
>
> Please can see detail in getOutlinks() method in DOMContentUtils class
> of parse-html plugin.
>
> you can revise the URLs before
>
> outlinks.add(new Outlink(url.toString(), linkText
> .toString().trim()));
>
> Hope it helps
>
> Regards
> /Jack
>
> On 9/28/05, Gal Nitzan <gn...@usa.net> wrote:
>
>>Hi Jack,
>>
>>How can you discard URL from fetchlist?
>>
>>Regards,
>>Gal
>>
>>Jack Tang wrote:
>>
>>>Hi Jon
>>>
>>>I think you can revise the URL by discarding "sid" param before
>>>putting it into fetchlist.
>>>
>>>Regards
>>>/Jack
>>>
>>>On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
>>>
>>>
>>>>Gal Nitzan wrote:
>>>>
>>>>
>>>>>Jon Shoberg wrote:
>>>>>
>>>>>
>>>>>
>>>>>>I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>>>>Its a phpBB which uses a question mark in the URL and sid.
>>>>>>
>>>>>>What have other people done to crawl forums and minimze duplicates?
>>>>>>These are ones that dedup is not catching.
>>>>>>
>>>>>>Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>>>>open the source and see...
>>>>>>
>>>>>>These URLs look like and appear to have the same content to the user:
>>>>>>
>>>>>>http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>>>>http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>>>>http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>>>
>>>>>>
>>>>>>Below is my regex normalize file:
>>>>>>
>>>>>><?xml version="1.0"?>
>>>>>><!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>>> This is intended so that users can specify substitutions to be
>>>>>> done on URLs. The regex engine that is used is Perl5 compatible.
>>>>>> The rules are applied to URLs in the order they occur in this
>>>>>>file. -->
>>>>>>
>>>>>><!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>>> expanded to & -->
>>>>>>
>>>>>><!-- The following rules show how to strip out session IDs
>>>>>> that are 32 characters long and have the parameter
>>>>>> name of PHPSESSID. Order does matter! -->
>>>>>><regex-normalize>
>>>>>><regex>
>>>>>> <pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>>> <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
>>>>>>
>>>>>> <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>> <pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>>> <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&|\&amp;)sid=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
>>>>>>
>>>>>> <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>></regex-normalize>
>>>>>>
>>>>>>.
>>>>>>
>>>>>>
>>>>>
>>>>>Hi Jon,
>>>>>
>>>>>I'm not sure if the normalize file is the correct place, I use the
>>>>>regex-urlfiter.xml with the following:
>>>>>
>>>>>-(session|Session|SESS|sid)
>>>>>
>>>>>I know it might leave a url like obsession.url out, but it is better
>>>>>than your fetcher running in circles :-)
>>>>>
>>>>>Hope it helps,
>>>>>
>>>>>Gal
>>>>>
>>>>
>>>>Yes,
>>>>
>>>> Better than circiles but I'm looking to refine the config to allow
>>>>for this, not just avoid them.
>>>>
>>>>-j
>>>>
>>>>
>>>
>>>
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>.
>>>
>>>
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars