You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2018/10/12 22:38:02 UTC

index-replace: variable substitution?

Greetings,

I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace
to copy a substring of the 'url' field to a new 'site' field.  Here is
the definition in my nutch-site.xml:

<property>
    <name>index.replace.regexp</name>
    <value>
       urlmatch=.*www.mydomain.ca.*
       url:site=/.*www.mydomain.ca.*/www/

       urlmatch=.*foo.mydomain.ca.*
       url:site=/.*foo.mydomain.ca.*/foo/

       urlmatch=.*bar.mydomain.ca.*
       url:site=/.*bar.mydomain.ca.*/bar/
    </value>
</property>

This works as expected.  I am given the following site values for the
given url values:

url: https://www.mydomain.ca/test/path -> site: www
url: http://foo.mydomain.ca/some/other/path -> site: foo
url: https://bar.mydomain.ca/another/example -> site: foo

However, it means I have to have a definition for every host or
subdomain I am crawling (ie. www, foo, bar).  Can I use variable
substitution in index-replace or is there another way for me to do this
automatically?

regards,
Ryan

Re: index-replace: variable substitution?

Posted by Ryan Suarez <ry...@sheridancollege.ca>.
Hi Yossi,

Thank you.  I finally got it to work using this configuration:

<property>
    <name>index.replace.regexp</name>
    <value>
       url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/
    </value>
</property>

cheers,
Ryan

On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote:
> Hi Ryan,
> 
>  
> 
> From looking at the code of index-replace, it uses Java's
> Matcher.replaceAll <
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-
> > , so $1 (for example) should work.
> 
>  
> 
> Yossi. 
> 
>  
> 
> > -----Original Message-----
> > From: Ryan Suarez <ry...@sheridancollege.ca>
> > Sent: 13 October 2018 01:38
> > To: user@nutch.apache.org
> > Subject: index-replace: variable substitution?
> > 
> > Greetings,
> > 
> > I'm using binaries of nutch v1.15 with solr v7.3.1, and index-
> > replace to copy a
> > substring of the 'url' field to a new 'site' field.  Here is the
> > definition in my nutch-
> > site.xml:
> > 
> > <property>
> >     <name>index.replace.regexp</name>
> >     <value>
> >        urlmatch=.*www.mydomain.ca.*
> >         <url:site=/.*www.mydomain.ca.*/www/>; url:site=/.*
> > www.mydomain.ca.*/www/
> > 
> >        urlmatch=.*foo.mydomain.ca.*
> >         <url:site=/.*foo.mydomain.ca.*/foo/>
> > url:site=/.*foo.mydomain.ca.*/foo/
> > 
> >        urlmatch=.*bar.mydomain.ca.*
> >         <url:site=/.*bar.mydomain.ca.*/bar/>
> > url:site=/.*bar.mydomain.ca.*/bar/
> >     </value>
> > </property>
> > 
> > This works as expected.  I am given the following site values for
> > the given url
> > values:
> > 
> > url:  <https://www.mydomain.ca/test/path> 
> > https://www.mydomain.ca/test/path -> site: www
> > url:  <http://foo.mydomain.ca/some/other/path> 
> > http://foo.mydomain.ca/some/other/path -> site: foo
> > url:  <https://bar.mydomain.ca/another/example> 
> > https://bar.mydomain.ca/another/example -> site: foo
> > 
> > However, it means I have to have a definition for every host or
> > subdomain I am
> > crawling (ie. www, foo, bar).  Can I use variable substitution in
> > index-replace or
> > is there another way for me to do this automatically?
> > 
> > regards,
> > Ryan
> 
> 

RE: index-replace: variable substitution?

Posted by Yossi Tamari <yo...@pipl.com>.
Hi Ryan,

 

From looking at the code of index-replace, it uses Java's Matcher.replaceAll <https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-> , so $1 (for example) should work.

 

Yossi. 

 

> -----Original Message-----

> From: Ryan Suarez <ry...@sheridancollege.ca>

> Sent: 13 October 2018 01:38

> To: user@nutch.apache.org

> Subject: index-replace: variable substitution?

> 

> Greetings,

> 

> I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace to copy a

> substring of the 'url' field to a new 'site' field.  Here is the definition in my nutch-

> site.xml:

> 

> <property>

>     <name>index.replace.regexp</name>

>     <value>

>        urlmatch=.*www.mydomain.ca.*

>         <url:site=/.*www.mydomain.ca.*/www/> url:site=/.*www.mydomain.ca.*/www/

> 

>        urlmatch=.*foo.mydomain.ca.*

>         <url:site=/.*foo.mydomain.ca.*/foo/> url:site=/.*foo.mydomain.ca.*/foo/

> 

>        urlmatch=.*bar.mydomain.ca.*

>         <url:site=/.*bar.mydomain.ca.*/bar/> url:site=/.*bar.mydomain.ca.*/bar/

>     </value>

> </property>

> 

> This works as expected.  I am given the following site values for the given url

> values:

> 

> url:  <https://www.mydomain.ca/test/path> https://www.mydomain.ca/test/path -> site: www

> url:  <http://foo.mydomain.ca/some/other/path> http://foo.mydomain.ca/some/other/path -> site: foo

> url:  <https://bar.mydomain.ca/another/example> https://bar.mydomain.ca/another/example -> site: foo

> 

> However, it means I have to have a definition for every host or subdomain I am

> crawling (ie. www, foo, bar).  Can I use variable substitution in index-replace or

> is there another way for me to do this automatically?

> 

> regards,

> Ryan