You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ryan Suarez <ry...@sheridancollege.ca> on 2018/10/12 22:38:02 UTC
index-replace: variable substitution?
Greetings,
I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace
to copy a substring of the 'url' field to a new 'site' field. Here is
the definition in my nutch-site.xml:
<property>
<name>index.replace.regexp</name>
<value>
urlmatch=.*www.mydomain.ca.*
url:site=/.*www.mydomain.ca.*/www/
urlmatch=.*foo.mydomain.ca.*
url:site=/.*foo.mydomain.ca.*/foo/
urlmatch=.*bar.mydomain.ca.*
url:site=/.*bar.mydomain.ca.*/bar/
</value>
</property>
This works as expected. I am given the following site values for the
given url values:
url: https://www.mydomain.ca/test/path -> site: www
url: http://foo.mydomain.ca/some/other/path -> site: foo
url: https://bar.mydomain.ca/another/example -> site: foo
However, it means I have to have a definition for every host or
subdomain I am crawling (ie. www, foo, bar). Can I use variable
substitution in index-replace or is there another way for me to do this
automatically?
regards,
Ryan
Re: index-replace: variable substitution?
Posted by Ryan Suarez <ry...@sheridancollege.ca>.
Hi Yossi,
Thank you. I finally got it to work using this configuration:
<property>
<name>index.replace.regexp</name>
<value>
url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/
</value>
</property>
cheers,
Ryan
On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote:
> Hi Ryan,
>
>
>
> From looking at the code of index-replace, it uses Java's
> Matcher.replaceAll <
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-
> > , so $1 (for example) should work.
>
>
>
> Yossi.
>
>
>
> > -----Original Message-----
> > From: Ryan Suarez <ry...@sheridancollege.ca>
> > Sent: 13 October 2018 01:38
> > To: user@nutch.apache.org
> > Subject: index-replace: variable substitution?
> >
> > Greetings,
> >
> > I'm using binaries of nutch v1.15 with solr v7.3.1, and index-
> > replace to copy a
> > substring of the 'url' field to a new 'site' field. Here is the
> > definition in my nutch-
> > site.xml:
> >
> > <property>
> > <name>index.replace.regexp</name>
> > <value>
> > urlmatch=.*www.mydomain.ca.*
> > <url:site=/.*www.mydomain.ca.*/www/>; url:site=/.*
> > www.mydomain.ca.*/www/
> >
> > urlmatch=.*foo.mydomain.ca.*
> > <url:site=/.*foo.mydomain.ca.*/foo/>
> > url:site=/.*foo.mydomain.ca.*/foo/
> >
> > urlmatch=.*bar.mydomain.ca.*
> > <url:site=/.*bar.mydomain.ca.*/bar/>
> > url:site=/.*bar.mydomain.ca.*/bar/
> > </value>
> > </property>
> >
> > This works as expected. I am given the following site values for
> > the given url
> > values:
> >
> > url: <https://www.mydomain.ca/test/path>
> > https://www.mydomain.ca/test/path -> site: www
> > url: <http://foo.mydomain.ca/some/other/path>
> > http://foo.mydomain.ca/some/other/path -> site: foo
> > url: <https://bar.mydomain.ca/another/example>
> > https://bar.mydomain.ca/another/example -> site: foo
> >
> > However, it means I have to have a definition for every host or
> > subdomain I am
> > crawling (ie. www, foo, bar). Can I use variable substitution in
> > index-replace or
> > is there another way for me to do this automatically?
> >
> > regards,
> > Ryan
>
>
RE: index-replace: variable substitution?
Posted by Yossi Tamari <yo...@pipl.com>.
Hi Ryan,
From looking at the code of index-replace, it uses Java's Matcher.replaceAll <https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-> , so $1 (for example) should work.
Yossi.
> -----Original Message-----
> From: Ryan Suarez <ry...@sheridancollege.ca>
> Sent: 13 October 2018 01:38
> To: user@nutch.apache.org
> Subject: index-replace: variable substitution?
>
> Greetings,
>
> I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace to copy a
> substring of the 'url' field to a new 'site' field. Here is the definition in my nutch-
> site.xml:
>
> <property>
> <name>index.replace.regexp</name>
> <value>
> urlmatch=.*www.mydomain.ca.*
> <url:site=/.*www.mydomain.ca.*/www/> url:site=/.*www.mydomain.ca.*/www/
>
> urlmatch=.*foo.mydomain.ca.*
> <url:site=/.*foo.mydomain.ca.*/foo/> url:site=/.*foo.mydomain.ca.*/foo/
>
> urlmatch=.*bar.mydomain.ca.*
> <url:site=/.*bar.mydomain.ca.*/bar/> url:site=/.*bar.mydomain.ca.*/bar/
> </value>
> </property>
>
> This works as expected. I am given the following site values for the given url
> values:
>
> url: <https://www.mydomain.ca/test/path> https://www.mydomain.ca/test/path -> site: www
> url: <http://foo.mydomain.ca/some/other/path> http://foo.mydomain.ca/some/other/path -> site: foo
> url: <https://bar.mydomain.ca/another/example> https://bar.mydomain.ca/another/example -> site: foo
>
> However, it means I have to have a definition for every host or subdomain I am
> crawling (ie. www, foo, bar). Can I use variable substitution in index-replace or
> is there another way for me to do this automatically?
>
> regards,
> Ryan