You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Smith Norton <sm...@gmail.com> on 2007/09/13 15:40:33 UTC
Sample normalize
In the regex-normalize.xml the following code is present.
<regex>
<pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
</regex-normalize>
Could anyone please explain me with an example, what type of URL it is
normalizing to what?
Re: Sample normalize
Posted by Carl Cerecke <ca...@nzs.com>.
Smith Norton wrote:
> In the regex-normalize.xml the following code is present.
>
> <regex>
> <pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
> <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
>
> Could anyone please explain me with an example, what type of URL it is
> normalizing to what?
I'm pretty sure the $1 means the first matched group in the regex, and
$3 the third. So in this case:
http://domain.name/path/bar.php&PHPSESSID=lajhdgfjdfhgkjasdgdfghsdfl&some-other-stuff-here
would normalise to:
http://domain.name/path/bar.php&some-other-stuff-here
In other words, it is stripping out the session ID.
Cheers,
Carl.
Re: Sample normalize
Posted by Marcin Okraszewski <ok...@o2.pl>.
It will simply remove PHPSESSIONID from URL.
For instance:
http://example.org/page.html?param1=1&PHPSESSIONID=ABCFEF¶m2=2
will be changed to
http://example.org/page.html?param1=1¶m2=2
Marcin
> In the regex-normalize.xml the following code is present.
>
> <regex>
> <pattern>(\?|\&|\&amp;)PHPSESSID=[a-zA-Z0-9]{32}(\&|\&amp;)(.*)</pattern>
> <substitution>$1$3</substitution>
> </regex>
> </regex-normalize>
>
> Could anyone please explain me with an example, what type of URL it is
> normalizing to what?