You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/09/28 05:11:49 UTC

SessionIDs and forums are killing my fetch

I'm getting a ton of duplicate content from a forum with sessionIDs. 
Its a phpBB which uses a question mark in the URL and sid.

What have other people done to crawl forums and minimze duplicates? 
These are ones that dedup is not catching.

Anyone able to offer how regex-normalize.xml is used. I'm about to open 
the source and see...

These URLs look like and appear to have the same content to the user:

http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea

Below is my regex normalize file:

<?xml version="1.0"?>
<!-- This is the configuration file for the RegexUrlNormalize Class.
      This is intended so that users can specify substitutions to be
      done on URLs. The regex engine that is used is Perl5 compatible.
      The rules are applied to URLs in the order they occur in this 
file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
      expanded to &amp; -->

<!-- The following rules show how to strip out session IDs
      that are 32 characters long and have the parameter
      name of PHPSESSID. Order does matter!  -->
<regex-normalize>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
   <substitution></substitution>
</regex>
<regex>
 
<pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
   <substitution>$1$3</substitution>
</regex>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
   <substitution></substitution>
</regex>
<regex>
 
<pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
   <substitution>$1$3</substitution>
</regex>
</regex-normalize>

Re: pattern matching and boolean searches

Posted by Robert Benea <ro...@gmail.com>.

I think you can build your own plug-in and do whatever type of search you
want(lucene style), I myself added a query plugin to handle my needs ;-).

Cheers,
R.

On 9/28/05, Edward Quick <ed...@hotmail.com> wrote:
>
> Hi,
>
> I posted this question the other day but didn't get a reply which may have
> been because it was an an annoying FAQ, or the subject wasn't catchy
> enough!
> Anyway, one more try, so here goes! Please help if you can.
>
> Should I be able to do lucene type searches with Nutch? I know Nutch can
> now
> do type: and url: queries, but how about pattern matching queries? For
> example:
>
> te*t
> tes?t
>
> or Boolean searches? I haven't got it to work so far, but wondered whether
> there was some I needed to enable. Incidentally, yes, I did enable the
> index-more and query-more plugins.
>
> Thanks for any help.
>
> Ed.
>
>
>

pattern matching and boolean searches

Posted by Edward Quick <ed...@hotmail.com>.

Hi,

I posted this question the other day but didn't get a reply which may have 
been because it was an an annoying FAQ, or the subject wasn't catchy enough! 
Anyway, one more try, so here goes! Please help if you can.

Should I be able to do lucene type searches with Nutch? I know Nutch can now 
do type: and url: queries, but how about pattern matching queries? For 
example:

te*t
tes?t

or Boolean searches? I haven't got it to work so far, but wondered whether 
there was some I needed to enable. Incidentally, yes, I did enable the 
index-more and query-more plugins.

Thanks for any help.

Ed.

Re: search with ndfs/mapred index

Posted by Gal Nitzan <gn...@usa.net>.

Gal Nitzan wrote:
> Hi,
>
> I have successfully run  mapred .
>
> How do I set the servlet to search the index which is under ndfs
>
> Thanks,
>
> Gal
>
> .
>
Please ignore, found the information, thanks

Re: search with ndfs/mapred index

Posted by Gal Nitzan <gn...@usa.net>.

Gal Nitzan wrote:
> Hi,
>
> I have successfully run  mapred .
>
> How do I set the servlet to search the index which is under ndfs
>
> Thanks,
>
> Gal
>
> .
>

OK, I figured out the part with the bin/nutch server and now the server 
is running.

I have created a file /mapred/search-servers.txt
Which contains the line:

localhost:8070

which I'm not sure is what should be there.

In the the WEB-INF/classes/nutch-default.xml I set the value of 
searcher.dir to point to /mapred where I have the aforementioned file .

Thanks,

Gal

Re: java.lang.ClassNotFoundException: org.apache.nutch.ipc.RPC$NullInstance - IGNORE sorry

Posted by Gal Nitzan <gn...@usa.net>.

Gal Nitzan wrote:
> Hi,
>
> While, connecting to search sever i have the following exception, does 
> anybody have a clue?
>
> 050927 205228 10 opening indexes in 
> /user/root/crawl-20050927142856/indexes/indexes
> 050927 205228 10 opening segments in 
> /user/root/crawl-20050927142856/indexes/segments
> 050927 205228 10 opening linkdb in 
> /user/root/crawl-20050927142856/indexes/linkdb
> 050927 205228 12 Server listener on port 8070: starting
> 050927 205228 13 Server handler on 8070: starting
> 050927 205228 14 Server handler on 8070: starting
> 050927 205228 15 Server handler on 8070: starting
> 050927 205228 16 Server handler on 8070: starting
> 050927 205228 17 Server handler on 8070: starting
> 050927 205228 18 Server handler on 8070: starting
> 050927 205228 19 Server handler on 8070: starting
> 050927 205228 21 Server handler on 8070: starting
> 050927 205228 22 Server handler on 8070: starting
> 050927 205228 20 Server handler on 8070: starting
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1: starting
> 050928 021500 21 Call: getSegmentNames()
> 050928 021500 21 Return: [Ljava.lang.String;@10da5eb
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1 caught: 
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.nutch.ipc.RPC$NullInstance
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.nutch.ipc.RPC$NullInstance
>        at 
> org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:183)
>        at org.apache.nutch.ipc.RPC$Invocation.readFields(RPC.java:88)
>        at org.apache.nutch.ipc.Server$Connection.run(Server.java:136)
> 050928 021500 23 Server connection on port 8070 from 127.0.0.1: exiting
> 050928 021510 24 Server connection on port 8070 from 127.0.0.1: starting
> 050928 021510 13 Call: getSegmentNames()
> 050928 021510 13 Return: [Ljava.lang.String;@10da5eb
>
> Gal
>
> .
>

java.lang.ClassNotFoundException: org.apache.nutch.ipc.RPC$NullInstance

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

While, connecting to search sever i have the following exception, does 
anybody have a clue?

050927 205228 10 opening indexes in 
/user/root/crawl-20050927142856/indexes/indexes
050927 205228 10 opening segments in 
/user/root/crawl-20050927142856/indexes/segments
050927 205228 10 opening linkdb in 
/user/root/crawl-20050927142856/indexes/linkdb
050927 205228 12 Server listener on port 8070: starting
050927 205228 13 Server handler on 8070: starting
050927 205228 14 Server handler on 8070: starting
050927 205228 15 Server handler on 8070: starting
050927 205228 16 Server handler on 8070: starting
050927 205228 17 Server handler on 8070: starting
050927 205228 18 Server handler on 8070: starting
050927 205228 19 Server handler on 8070: starting
050927 205228 21 Server handler on 8070: starting
050927 205228 22 Server handler on 8070: starting
050927 205228 20 Server handler on 8070: starting
050928 021500 23 Server connection on port 8070 from 127.0.0.1: starting
050928 021500 21 Call: getSegmentNames()
050928 021500 21 Return: [Ljava.lang.String;@10da5eb
050928 021500 23 Server connection on port 8070 from 127.0.0.1 caught: 
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.ipc.RPC$NullInstance
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.nutch.ipc.RPC$NullInstance
        at 
org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:183)
        at org.apache.nutch.ipc.RPC$Invocation.readFields(RPC.java:88)
        at org.apache.nutch.ipc.Server$Connection.run(Server.java:136)
050928 021500 23 Server connection on port 8070 from 127.0.0.1: exiting
050928 021510 24 Server connection on port 8070 from 127.0.0.1: starting
050928 021510 13 Call: getSegmentNames()
050928 021510 13 Return: [Ljava.lang.String;@10da5eb

Gal

search with ndfs/mapred index

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

I have successfully run  mapred .

How do I set the servlet to search the index which is under ndfs

Thanks,

Gal

regex-normalize - Re: SessionIDs and forums are killing my fetch

Posted by Jon Shoberg <jo...@shoberg.net>.

I thought this could be done via regex-normalize?  It is my preference 
to use functionality/features of the confuguration rather than 
maintaining a local patch.

-j

Jack Tang wrote:
> Hi Jon
> 
> Please can see detail in getOutlinks() method in DOMContentUtils class
> of parse-html plugin.
> 
> you can revise the URLs before
> 
> outlinks.add(new Outlink(url.toString(), linkText
>                                     .toString().trim()));
> 
> Hope it helps
> 
> Regards
> /Jack
> 
> On 9/28/05, Gal Nitzan <gn...@usa.net> wrote:
> 
>>Hi Jack,
>>
>>How can you discard URL from fetchlist?
>>
>>Regards,
>>Gal
>>
>>Jack Tang wrote:
>>
>>>Hi Jon
>>>
>>>I think you can revise the URL by discarding "sid" param before
>>>putting it into fetchlist.
>>>
>>>Regards
>>>/Jack
>>>
>>>On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
>>>
>>>
>>>>Gal Nitzan wrote:
>>>>
>>>>
>>>>>Jon Shoberg wrote:
>>>>>
>>>>>
>>>>>
>>>>>>I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>>>>Its a phpBB which uses a question mark in the URL and sid.
>>>>>>
>>>>>>What have other people done to crawl forums and minimze duplicates?
>>>>>>These are ones that dedup is not catching.
>>>>>>
>>>>>>Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>>>>open the source and see...
>>>>>>
>>>>>>These URLs look like and appear to have the same content to the user:
>>>>>>
>>>>>>http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>>>>http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>>>>http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>>>
>>>>>>
>>>>>>Below is my regex normalize file:
>>>>>>
>>>>>><?xml version="1.0"?>
>>>>>><!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>>>     This is intended so that users can specify substitutions to be
>>>>>>     done on URLs. The regex engine that is used is Perl5 compatible.
>>>>>>     The rules are applied to URLs in the order they occur in this
>>>>>>file.  -->
>>>>>>
>>>>>><!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>>>     expanded to &amp; -->
>>>>>>
>>>>>><!-- The following rules show how to strip out session IDs
>>>>>>     that are 32 characters long and have the parameter
>>>>>>     name of PHPSESSID. Order does matter!  -->
>>>>>><regex-normalize>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>  <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>>>  <substitution></substitution>
>>>>>></regex>
>>>>>><regex>
>>>>>>
>>>>>><pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>>>
>>>>>>  <substitution>$1$3</substitution>
>>>>>></regex>
>>>>>></regex-normalize>
>>>>>>
>>>>>>.
>>>>>>
>>>>>>
>>>>>
>>>>>Hi Jon,
>>>>>
>>>>>I'm not sure if the normalize file is the correct place, I use the
>>>>>regex-urlfiter.xml with the following:
>>>>>
>>>>>-(session|Session|SESS|sid)
>>>>>
>>>>>I know it might leave a url like obsession.url out, but it is better
>>>>>than your fetcher running in circles :-)
>>>>>
>>>>>Hope it helps,
>>>>>
>>>>>Gal
>>>>>
>>>>
>>>>Yes,
>>>>
>>>>   Better than circiles but I'm looking to refine the config to allow
>>>>for this, not just avoid them.
>>>>
>>>>-j
>>>>
>>>>
>>>
>>>
>>>--
>>>Keep Discovering ... ...
>>>http://www.jroller.com/page/jmars
>>>
>>>.
>>>
>>>
>>
>>
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars

Re: SessionIDs and forums are killing my fetch

Posted by Jack Tang <hi...@gmail.com>.

Hi Jon

Please can see detail in getOutlinks() method in DOMContentUtils class
of parse-html plugin.

you can revise the URLs before

outlinks.add(new Outlink(url.toString(), linkText
                                    .toString().trim()));

Hope it helps

Regards
/Jack

On 9/28/05, Gal Nitzan <gn...@usa.net> wrote:
> Hi Jack,
>
> How can you discard URL from fetchlist?
>
> Regards,
> Gal
>
> Jack Tang wrote:
> > Hi Jon
> >
> > I think you can revise the URL by discarding "sid" param before
> > putting it into fetchlist.
> >
> > Regards
> > /Jack
> >
> > On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
> >
> >> Gal Nitzan wrote:
> >>
> >>> Jon Shoberg wrote:
> >>>
> >>>
> >>>> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >>>> Its a phpBB which uses a question mark in the URL and sid.
> >>>>
> >>>> What have other people done to crawl forums and minimze duplicates?
> >>>> These are ones that dedup is not catching.
> >>>>
> >>>> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >>>> open the source and see...
> >>>>
> >>>> These URLs look like and appear to have the same content to the user:
> >>>>
> >>>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >>>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >>>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>>>
> >>>>
> >>>> Below is my regex normalize file:
> >>>>
> >>>> <?xml version="1.0"?>
> >>>> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>>>      This is intended so that users can specify substitutions to be
> >>>>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>>>      The rules are applied to URLs in the order they occur in this
> >>>> file.  -->
> >>>>
> >>>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>>>      expanded to &amp; -->
> >>>>
> >>>> <!-- The following rules show how to strip out session IDs
> >>>>      that are 32 characters long and have the parameter
> >>>>      name of PHPSESSID. Order does matter!  -->
> >>>> <regex-normalize>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>>>   <substitution></substitution>
> >>>> </regex>
> >>>> <regex>
> >>>>
> >>>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>>>
> >>>>   <substitution>$1$3</substitution>
> >>>> </regex>
> >>>> </regex-normalize>
> >>>>
> >>>> .
> >>>>
> >>>>
> >>> Hi Jon,
> >>>
> >>> I'm not sure if the normalize file is the correct place, I use the
> >>> regex-urlfiter.xml with the following:
> >>>
> >>> -(session|Session|SESS|sid)
> >>>
> >>> I know it might leave a url like obsession.url out, but it is better
> >>> than your fetcher running in circles :-)
> >>>
> >>> Hope it helps,
> >>>
> >>> Gal
> >>>
> >> Yes,
> >>
> >>    Better than circiles but I'm looking to refine the config to allow
> >> for this, not just avoid them.
> >>
> >> -j
> >>
> >>
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> > .
> >
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: SessionIDs and forums are killing my fetch

Posted by Gal Nitzan <gn...@usa.net>.

Hi Jack,

How can you discard URL from fetchlist?

Regards,
Gal

Jack Tang wrote:
> Hi Jon
>
> I think you can revise the URL by discarding "sid" param before
> putting it into fetchlist.
>
> Regards
> /Jack
>
> On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
>   
>> Gal Nitzan wrote:
>>     
>>> Jon Shoberg wrote:
>>>
>>>       
>>>> I'm getting a ton of duplicate content from a forum with sessionIDs.
>>>> Its a phpBB which uses a question mark in the URL and sid.
>>>>
>>>> What have other people done to crawl forums and minimze duplicates?
>>>> These are ones that dedup is not catching.
>>>>
>>>> Anyone able to offer how regex-normalize.xml is used. I'm about to
>>>> open the source and see...
>>>>
>>>> These URLs look like and appear to have the same content to the user:
>>>>
>>>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>>>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>>>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
>>>>
>>>>
>>>> Below is my regex normalize file:
>>>>
>>>> <?xml version="1.0"?>
>>>> <!-- This is the configuration file for the RegexUrlNormalize Class.
>>>>      This is intended so that users can specify substitutions to be
>>>>      done on URLs. The regex engine that is used is Perl5 compatible.
>>>>      The rules are applied to URLs in the order they occur in this
>>>> file.  -->
>>>>
>>>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>>>      expanded to &amp; -->
>>>>
>>>> <!-- The following rules show how to strip out session IDs
>>>>      that are 32 characters long and have the parameter
>>>>      name of PHPSESSID. Order does matter!  -->
>>>> <regex-normalize>
>>>> <regex>
>>>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>>>   <substitution></substitution>
>>>> </regex>
>>>> <regex>
>>>>
>>>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>
>>>>   <substitution>$1$3</substitution>
>>>> </regex>
>>>> <regex>
>>>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>>>   <substitution></substitution>
>>>> </regex>
>>>> <regex>
>>>>
>>>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>>>>
>>>>   <substitution>$1$3</substitution>
>>>> </regex>
>>>> </regex-normalize>
>>>>
>>>> .
>>>>
>>>>         
>>> Hi Jon,
>>>
>>> I'm not sure if the normalize file is the correct place, I use the
>>> regex-urlfiter.xml with the following:
>>>
>>> -(session|Session|SESS|sid)
>>>
>>> I know it might leave a url like obsession.url out, but it is better
>>> than your fetcher running in circles :-)
>>>
>>> Hope it helps,
>>>
>>> Gal
>>>       
>> Yes,
>>
>>    Better than circiles but I'm looking to refine the config to allow
>> for this, not just avoid them.
>>
>> -j
>>
>>     
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
> .
>
>

Re: SessionIDs and forums are killing my fetch

Posted by Jack Tang <hi...@gmail.com>.

Hi Jon

I think you can revise the URL by discarding "sid" param before
putting it into fetchlist.

Regards
/Jack

On 9/28/05, Jon Shoberg <jo...@shoberg.net> wrote:
> Gal Nitzan wrote:
> > Jon Shoberg wrote:
> >
> >> I'm getting a ton of duplicate content from a forum with sessionIDs.
> >> Its a phpBB which uses a question mark in the URL and sid.
> >>
> >> What have other people done to crawl forums and minimze duplicates?
> >> These are ones that dedup is not catching.
> >>
> >> Anyone able to offer how regex-normalize.xml is used. I'm about to
> >> open the source and see...
> >>
> >> These URLs look like and appear to have the same content to the user:
> >>
> >> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> >> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> >> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea
> >>
> >>
> >> Below is my regex normalize file:
> >>
> >> <?xml version="1.0"?>
> >> <!-- This is the configuration file for the RegexUrlNormalize Class.
> >>      This is intended so that users can specify substitutions to be
> >>      done on URLs. The regex engine that is used is Perl5 compatible.
> >>      The rules are applied to URLs in the order they occur in this
> >> file.  -->
> >>
> >> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
> >>      expanded to &amp; -->
> >>
> >> <!-- The following rules show how to strip out session IDs
> >>      that are 32 characters long and have the parameter
> >>      name of PHPSESSID. Order does matter!  -->
> >> <regex-normalize>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> <regex>
> >>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
> >>   <substitution></substitution>
> >> </regex>
> >> <regex>
> >>
> >> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
> >>
> >>   <substitution>$1$3</substitution>
> >> </regex>
> >> </regex-normalize>
> >>
> >> .
> >>
> >
> > Hi Jon,
> >
> > I'm not sure if the normalize file is the correct place, I use the
> > regex-urlfiter.xml with the following:
> >
> > -(session|Session|SESS|sid)
> >
> > I know it might leave a url like obsession.url out, but it is better
> > than your fetcher running in circles :-)
> >
> > Hope it helps,
> >
> > Gal
>
> Yes,
>
>    Better than circiles but I'm looking to refine the config to allow
> for this, not just avoid them.
>
> -j
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: SessionIDs and forums are killing my fetch

Posted by Jon Shoberg <jo...@shoberg.net>.

Gal Nitzan wrote:
> Jon Shoberg wrote:
> 
>> I'm getting a ton of duplicate content from a forum with sessionIDs. 
>> Its a phpBB which uses a question mark in the URL and sid.
>>
>> What have other people done to crawl forums and minimze duplicates? 
>> These are ones that dedup is not catching.
>>
>> Anyone able to offer how regex-normalize.xml is used. I'm about to 
>> open the source and see...
>>
>> These URLs look like and appear to have the same content to the user:
>>
>> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
>> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
>> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea 
>>
>>
>> Below is my regex normalize file:
>>
>> <?xml version="1.0"?>
>> <!-- This is the configuration file for the RegexUrlNormalize Class.
>>      This is intended so that users can specify substitutions to be
>>      done on URLs. The regex engine that is used is Perl5 compatible.
>>      The rules are applied to URLs in the order they occur in this 
>> file.  -->
>>
>> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>>      expanded to &amp; -->
>>
>> <!-- The following rules show how to strip out session IDs
>>      that are 32 characters long and have the parameter
>>      name of PHPSESSID. Order does matter!  -->
>> <regex-normalize>
>> <regex>
>>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>>   <substitution></substitution>
>> </regex>
>> <regex>
>>
>> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern> 
>>
>>   <substitution>$1$3</substitution>
>> </regex>
>> <regex>
>>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>>   <substitution></substitution>
>> </regex>
>> <regex>
>>
>> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern> 
>>
>>   <substitution>$1$3</substitution>
>> </regex>
>> </regex-normalize>
>>
>> .
>>
> 
> Hi Jon,
> 
> I'm not sure if the normalize file is the correct place, I use the 
> regex-urlfiter.xml with the following:
> 
> -(session|Session|SESS|sid)
> 
> I know it might leave a url like obsession.url out, but it is better 
> than your fetcher running in circles :-)
> 
> Hope it helps,
> 
> Gal

Yes,

   Better than circiles but I'm looking to refine the config to allow 
for this, not just avoid them.

-j

Re: SessionIDs and forums are killing my fetch

Posted by Gal Nitzan <gn...@usa.net>.

Jon Shoberg wrote:
> I'm getting a ton of duplicate content from a forum with sessionIDs. 
> Its a phpBB which uses a question mark in the URL and sid.
>
> What have other people done to crawl forums and minimze duplicates? 
> These are ones that dedup is not catching.
>
> Anyone able to offer how regex-normalize.xml is used. I'm about to 
> open the source and see...
>
> These URLs look like and appear to have the same content to the user:
>
> http://domain/forum/faq.php?sid=0c5dd3b77aac9d081b2108cfb3dae592
> http://domain/forum/faq.php?sid=0c6f7ba099d901c634379994cdacc611
> http://domain/forum/login.php?redirect=profile.php&mode=editprofile&sid=aa4d00c2717784c0c8f5a8182ce772ea 
>
>
> Below is my regex normalize file:
>
> <?xml version="1.0"?>
> <!-- This is the configuration file for the RegexUrlNormalize Class.
>      This is intended so that users can specify substitutions to be
>      done on URLs. The regex engine that is used is Perl5 compatible.
>      The rules are applied to URLs in the order they occur in this 
> file.  -->
>
> <!-- WATCH OUT: an xml parser reads this file an ampersands must be
>      expanded to &amp; -->
>
> <!-- The following rules show how to strip out session IDs
>      that are 32 characters long and have the parameter
>      name of PHPSESSID. Order does matter!  -->
> <regex-normalize>
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}$</pattern>
>   <substitution></substitution>
> </regex>
> <regex>
>
> <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern> 
>
>   <substitution>$1$3</substitution>
> </regex>
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}$</pattern>
>   <substitution></substitution>
> </regex>
> <regex>
>
> <pattern>(\?|\&amp;|\&amp;amp;)sid=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern> 
>
>   <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
>
> .
>

Hi Jon,

I'm not sure if the normalize file is the correct place, I use the 
regex-urlfiter.xml with the following:

-(session|Session|SESS|sid)

I know it might leave a url like obsession.url out, but it is better 
than your fetcher running in circles :-)

Hope it helps,

Gal