You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Philip Brown <ph...@primeradesigns.com> on 2006/09/01 12:32:37 UTC
regex-normalizer.xml substitution value?
I am new to regex. What will the "$1$3" reproduce in the following
element. What values are $1$3?
<regex>
<pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
if I leave substitution as <substitution></substitution> will this
just get rid of ;jsessionid=123456789...
thanks
Re: regex-normalizer.xml substitution value?
Posted by Philip Brown <ph...@primeradesigns.com>.
bug-ger!
Hi,
to get regex-normalize.xml to work i must put:
<property>
<name>urlnormalizer.class</name>
<value>org.apache.nutch.net.RegexUrlNormalizer</value>
<description>Name of the class used to normalize URLs.</description>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer
class.</description>
</property>
in nutch-site.xml
In nutch-default.xml there is set:
<property>
<name>urlnormalizer.class</name>
<value>org.apache.nutch.net.BasicUrlNormalizer</value>
<description>Name of the class used to normalize URLs.</description>
</property>
Is this a bug or a feature? =)
Re: regex-normalizer.xml substitution value?
Posted by Philip Brown <ph...@primeradesigns.com>.
Philip Brown wrote:
> I am new to regex. What will the "$1$3" reproduce in the following
> element. What values are $1$3?
>
> <regex>
> <pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
> <substitution>$1$3</substitution>
> </regex>
>
> if I leave substitution as <substitution></substitution> will this
> just get rid of ;jsessionid=123456789...
>
> thanks
>
>
> .
>
Am I wrong to assume $1 = (.*) & $3 = (.*). Which would make $2 the
following (;jsessionid=[a-zA-Z0-9]{32}) from
(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)
I have run a crawl with this value:
<regex>
<pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
<substitution>$1$3</substitution>
</regex>
added to my regex-normalizer.xml
however it did not change db. does bin/nutch crawl need a special
command to get this to run?