You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Ralf R. Kotowski" <rr...@enlle.com> on 2013/11/02 18:15:25 UTC

Language identification

Hi,

 

What is the correct process to only store documents in a desired language?

 

I'm currently doing this:

 

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

 

Using a seed.txt with URL's I know are in the language I want, but as the
crawl grows it seems I'm starting to get more and more docs in other
languages.

 

 

Thnx in advance

Re: Language identification

Posted by Gavin Engel <ga...@engel.com>.

I'm sorry to bother you all with this question.  I'm having trouble
unsubscribing from this mailing list.  Would anyone tell me how to do this,
please?


On Sat, Nov 2, 2013 at 1:15 PM, Ralf R. Kotowski <rr...@enlle.com> wrote:

> Hi,
>
>
>
> What is the correct process to only store documents in a desired language?
>
>
>
> I'm currently doing this:
>
>
>
> <property>
> <name>http.accept.language</name>
> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> <description>Value of the "Accept-Language" request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national group.
> </description>
> </property>
>
>
>
> Using a seed.txt with URL's I know are in the language I want, but as the
> crawl grows it seems I'm starting to get more and more docs in other
> languages.
>
>
>
>
>
> Thnx in advance
>
>

RE: Language identification

Posted by Markus Jelsma <ma...@openindex.io>.

If you have no experience coding for Nutch and want a quick win, and already have the lang field in your parse metadata, you can hack into trunk's index-more filter and reject other languages there. 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 14:26
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> OK, when I do this on the SVN trunk I get:
> 
> blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
> language-filter.patch
> patching file conf/nutch-default.xml
> Hunk #1 succeeded at 941 (offset 19 lines).
> patching file ivy/ivy.xml
> Hunk #1 FAILED at 111.
> 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #1 succeeded at 30 with fuzz 1.
> Hunk #2 succeeded at 79 with fuzz 1.
> Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
> patching file src/plugin/language-filter/build.xml
> patching file src/plugin/language-filter/ivy.xml
> patching file src/plugin/language-filter/plugin.xml
> patching file
> src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
> ter.java
> patching file
> src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
> eFilter.java
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: Tuesday, November 05, 2013 1:17 PM
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> These are git patches and work differently then we are used to at the ASF
> (a/ and b/ prefixes).
> In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
> patches.
> 
>  
>  
> -----Original message-----
> > From:Ralf R. Kotowski <rr...@enlle.com>
> > Sent: Tuesday 5th November 2013 13:12
> > To: user@nutch.apache.org
> > Subject: RE: Language identification
> > 
> > Thank you,
> > 
> > I'm still learning ow to patch nutch... not much luck so far...
> > 
> > -----Original Message-----
> > From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> > Sent: Tuesday, November 05, 2013 10:36 AM
> > To: user@nutch.apache.org
> > Subject: Re: Language identification
> > 
> > Hi Ralf,
> > 
> > I patched language-filter plugin for filter or accept pages which 
> > specified languages while parse phase.
> > 
> > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> > 
> > 
> > On 02-11-2013 22:05, Julien Nioche wrote:
> > > Ralf,
> > >
> > > The parameter http.accept.language tells the servers you are hitting
> that
> > > they should provide you the content in the languages you specified but
> > that
> > > does not give you any guarantees nor allows you to filter the content.
> > Look
> > > at the languageidentifier plugin as a starting point, then you could add
> a
> > > custom mapreduce job to remove the pages which are not in the languages
> of
> > > interest.
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > >
> > > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >>
> > >> What is the correct process to only store documents in a desired
> > language?
> > >>
> > >>
> > >>
> > >> I'm currently doing this:
> > >>
> > >>
> > >>
> > >> <property>
> > >> <name>http.accept.language</name>
> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> > >> <description>Value of the "Accept-Language" request header field.
> > >> This allows selecting non-English language as default one to retrieve.
> > >> It is a useful setting for search engines build for certain national
> > group.
> > >> </description>
> > >> </property>
> > >>
> > >>
> > >>
> > >> Using a seed.txt with URL's I know are in the language I want, but as
> the
> > >> crawl grows it seems I'm starting to get more and more docs in other
> > >> languages.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thnx in advance
> > >>
> > >>
> > >
> > 
> > 
> > 
> 
>