You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Philip Brown <ph...@primeradesigns.com> on 2006/09/01 12:32:37 UTC

regex-normalizer.xml substitution value?

I am new to regex. What will the "$1$3" reproduce in the following 
element. What values are $1$3?

<regex>
  <pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
  <substitution>$1$3</substitution>
</regex>

if I leave substitution as   <substitution></substitution> will this 
just get rid of ;jsessionid=123456789...

thanks

Re: regex-normalizer.xml substitution value?

Posted by Philip Brown <ph...@primeradesigns.com>.
bug-ger!

Hi,

to get regex-normalize.xml to work i must put:

<property>
   <name>urlnormalizer.class</name>
   <value>org.apache.nutch.net.RegexUrlNormalizer</value>
   <description>Name of the class used to normalize URLs.</description>
</property>

<property>
   <name>urlnormalizer.regex.file</name>
   <value>regex-normalize.xml</value>
   <description>Name of the config file used by the RegexUrlNormalizer 
class.</description>
</property>

in nutch-site.xml

In nutch-default.xml there is set:

<property>
   <name>urlnormalizer.class</name>
   <value>org.apache.nutch.net.BasicUrlNormalizer</value>
   <description>Name of the class used to normalize URLs.</description>
</property>

Is this a bug or a feature? =)



Re: regex-normalizer.xml substitution value?

Posted by Philip Brown <ph...@primeradesigns.com>.
Philip Brown wrote:
> I am new to regex. What will the "$1$3" reproduce in the following 
> element. What values are $1$3?
>
> <regex>
>  <pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
>  <substitution>$1$3</substitution>
> </regex>
>
> if I leave substitution as   <substitution></substitution> will this 
> just get rid of ;jsessionid=123456789...
>
> thanks
>
>
> .
>
Am I wrong to assume $1 = (.*) & $3 = (.*). Which would make $2 the 
following (;jsessionid=[a-zA-Z0-9]{32}) from  
(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)

I have run a crawl with this value:

<regex>
 <pattern>(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)</pattern>
 <substitution>$1$3</substitution>
</regex>

added to my regex-normalizer.xml

however it did not change db. does bin/nutch crawl need a special 
command to get this to run?