You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Smith Norton <sm...@gmail.com> on 2007/09/13 15:40:33 UTC

Sample normalize

In the regex-normalize.xml the following code is present.

<regex>
  <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
  <substitution>$1$3</substitution>
</regex>
</regex-normalize>

Could anyone please explain me with an example, what type of URL it is
normalizing to what?

Re: Sample normalize

Posted by Carl Cerecke <ca...@nzs.com>.
Smith Norton wrote:
> In the regex-normalize.xml the following code is present.
> 
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>   <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
> 
> Could anyone please explain me with an example, what type of URL it is
> normalizing to what?

I'm pretty sure the $1 means the first matched group in the regex, and 
$3 the third. So in this case:

http://domain.name/path/bar.php&PHPSESSID=lajhdgfjdfhgkjasdgdfghsdfl&some-other-stuff-here

would normalise to:

http://domain.name/path/bar.php&some-other-stuff-here

In other words, it is stripping out the session ID.

Cheers,
Carl.

Re: Sample normalize

Posted by Marcin Okraszewski <ok...@o2.pl>.
It will simply remove PHPSESSIONID from URL.

For instance:
http://example.org/page.html?param1=1&PHPSESSIONID=ABCFEF&param2=2

will be changed to
http://example.org/page.html?param1=1&param2=2

Marcin


> In the regex-normalize.xml the following code is present.
> 
> <regex>
>   <pattern>(\?|\&amp;|\&amp;amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&amp;|\&amp;amp;)(.*)</pattern>
>   <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
> 
> Could anyone please explain me with an example, what type of URL it is
> normalizing to what?