You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yongyao Jiang <j....@gmail.com> on 2017/07/27 16:09:37 UTC
Accept language and url filter not working
Hi all,
I am having some issues with the "http.accept.language" and
"urlfilter-regex” functions. My goal is to collect only english webpages,
and disregard all "wikipedia" pages.
1. I have added the following content in the nutch-site.xml, but the result
still contains lots of "zh, ca, fr, etc." In addition, I also changed this
in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
nutch-site.xml to do this.
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national
group.
</description>
</property>
2. With respect to the "urlfilter-regex", I have added the following
configurations in nutch-site.xml and regex-urlfilter.txt.
<property>
<name>plugin.includes</name>
<value>protocol-http|*urlfilter-regex*
|parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value>
</property>
*-^.*wikipedia.*$*
Thanks,
Yongyao
--
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University
Re: Accept language and url filter not working
Posted by Yongyao Jiang <j....@gmail.com>.
Thanks, Markus. Actually, I don't need wikipedia at all even if they are in
English, so I think this urlfilter-domain won't work.
Yongyao
On Thu, Jul 27, 2017 at 12:18 PM, Markus Jelsma <ma...@openindex.io>
wrote:
> Hello,
>
> If en.wikipedia.org is all you are after, enabled urlfilter-domain, add
> the hostname to the domain-urlfilter.txt file and all non-english
> hyperlinks are discarded.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Yongyao Jiang <j....@gmail.com>
> > Sent: Thursday 27th July 2017 18:09
> > To: user@nutch.apache.org
> > Cc: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
> > Subject: Accept language and url filter not working
> >
> > Hi all,
> >
> > I am having some issues with the "http.accept.language" and
> > "urlfilter-regex” functions. My goal is to collect only english webpages,
> > and disregard all "wikipedia" pages.
> >
> > 1. I have added the following content in the nutch-site.xml, but the
> result
> > still contains lots of "zh, ca, fr, etc." In addition, I also changed
> this
> > in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
> > nutch-site.xml to do this.
> >
> > <property>
> > <name>http.accept.language</name>
> > <value>en-us,en-gb,en</value>
> > <description>Value of the "Accept-Language" request header field.
> > This allows selecting non-English language as default one to retrieve.
> > It is a useful setting for search engines build for certain national
> > group.
> > </description>
> > </property>
> >
> > 2. With respect to the "urlfilter-regex", I have added the following
> > configurations in nutch-site.xml and regex-urlfilter.txt.
> >
> > <property>
> > <name>plugin.includes</name>
> > <value>protocol-http|*urlfilter-regex*
> > |parse-(tika)|index-(anchor|basic|more|static|replace|
> links)|indexer-elastic|urlnormalizer-basic|scoring-(
> opic|similarity)|language-identifier|protocol-httpclient</value>
> > </property>
> >
> > *-^.*wikipedia.*$*
> >
> > Thanks,
> > Yongyao
> >
> >
> > --
> > Yongyao Jiang
> > https://www.linkedin.com/in/yongyao-jiang-42516164
> > Ph.D. Student in Earth Systems and GeoInformation Sciences
> > NSF Spatiotemporal Innovation Center
> > George Mason University
> >
>
--
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University
RE: Accept language and url filter not working
Posted by Markus Jelsma <ma...@openindex.io>.
Hello,
If en.wikipedia.org is all you are after, enabled urlfilter-domain, add the hostname to the domain-urlfilter.txt file and all non-english hyperlinks are discarded.
Regards,
Markus
-----Original message-----
> From:Yongyao Jiang <j....@gmail.com>
> Sent: Thursday 27th July 2017 18:09
> To: user@nutch.apache.org
> Cc: Mcgibbney, Lewis J (398M) <Le...@jpl.nasa.gov>
> Subject: Accept language and url filter not working
>
> Hi all,
>
> I am having some issues with the "http.accept.language" and
> "urlfilter-regex” functions. My goal is to collect only english webpages,
> and disregard all "wikipedia" pages.
>
> 1. I have added the following content in the nutch-site.xml, but the result
> still contains lots of "zh, ca, fr, etc." In addition, I also changed this
> in nutch-default.xml to be safe. Wonder if I need to add a plugin to the
> nutch-site.xml to do this.
>
> <property>
> <name>http.accept.language</name>
> <value>en-us,en-gb,en</value>
> <description>Value of the "Accept-Language" request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national
> group.
> </description>
> </property>
>
> 2. With respect to the "urlfilter-regex", I have added the following
> configurations in nutch-site.xml and regex-urlfilter.txt.
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|*urlfilter-regex*
> |parse-(tika)|index-(anchor|basic|more|static|replace|links)|indexer-elastic|urlnormalizer-basic|scoring-(opic|similarity)|language-identifier|protocol-httpclient</value>
> </property>
>
> *-^.*wikipedia.*$*
>
> Thanks,
> Yongyao
>
>
> --
> Yongyao Jiang
> https://www.linkedin.com/in/yongyao-jiang-42516164
> Ph.D. Student in Earth Systems and GeoInformation Sciences
> NSF Spatiotemporal Innovation Center
> George Mason University
>