You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Ralf R. Kotowski" <rr...@enlle.com> on 2013/11/02 18:15:25 UTC

Language identification

Hi,

 

What is the correct process to only store documents in a desired language?

 

I'm currently doing this:

 

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

 

Using a seed.txt with URL's I know are in the language I want, but as the
crawl grows it seems I'm starting to get more and more docs in other
languages.

 

 

Thnx in advance


Re: Language identification

Posted by Gavin Engel <ga...@engel.com>.
I'm sorry to bother you all with this question.  I'm having trouble
unsubscribing from this mailing list.  Would anyone tell me how to do this,
please?


On Sat, Nov 2, 2013 at 1:15 PM, Ralf R. Kotowski <rr...@enlle.com> wrote:

> Hi,
>
>
>
> What is the correct process to only store documents in a desired language?
>
>
>
> I'm currently doing this:
>
>
>
> <property>
> <name>http.accept.language</name>
> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> <description>Value of the "Accept-Language" request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national group.
> </description>
> </property>
>
>
>
> Using a seed.txt with URL's I know are in the language I want, but as the
> crawl grows it seems I'm starting to get more and more docs in other
> languages.
>
>
>
>
>
> Thnx in advance
>
>

RE: Language identification

Posted by Markus Jelsma <ma...@openindex.io>.
If you have no experience coding for Nutch and want a quick win, and already have the lang field in your parse metadata, you can hack into trunk's index-more filter and reject other languages there. 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 14:26
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> OK, when I do this on the SVN trunk I get:
> 
> blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
> language-filter.patch
> patching file conf/nutch-default.xml
> Hunk #1 succeeded at 941 (offset 19 lines).
> patching file ivy/ivy.xml
> Hunk #1 FAILED at 111.
> 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #1 succeeded at 30 with fuzz 1.
> Hunk #2 succeeded at 79 with fuzz 1.
> Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
> patching file src/plugin/language-filter/build.xml
> patching file src/plugin/language-filter/ivy.xml
> patching file src/plugin/language-filter/plugin.xml
> patching file
> src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
> ter.java
> patching file
> src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
> eFilter.java
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: Tuesday, November 05, 2013 1:17 PM
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> These are git patches and work differently then we are used to at the ASF
> (a/ and b/ prefixes).
> In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
> patches.
> 
>  
>  
> -----Original message-----
> > From:Ralf R. Kotowski <rr...@enlle.com>
> > Sent: Tuesday 5th November 2013 13:12
> > To: user@nutch.apache.org
> > Subject: RE: Language identification
> > 
> > Thank you,
> > 
> > I'm still learning ow to patch nutch... not much luck so far...
> > 
> > -----Original Message-----
> > From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> > Sent: Tuesday, November 05, 2013 10:36 AM
> > To: user@nutch.apache.org
> > Subject: Re: Language identification
> > 
> > Hi Ralf,
> > 
> > I patched language-filter plugin for filter or accept pages which 
> > specified languages while parse phase.
> > 
> > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> > 
> > 
> > On 02-11-2013 22:05, Julien Nioche wrote:
> > > Ralf,
> > >
> > > The parameter http.accept.language tells the servers you are hitting
> that
> > > they should provide you the content in the languages you specified but
> > that
> > > does not give you any guarantees nor allows you to filter the content.
> > Look
> > > at the languageidentifier plugin as a starting point, then you could add
> a
> > > custom mapreduce job to remove the pages which are not in the languages
> of
> > > interest.
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > >
> > > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >>
> > >> What is the correct process to only store documents in a desired
> > language?
> > >>
> > >>
> > >>
> > >> I'm currently doing this:
> > >>
> > >>
> > >>
> > >> <property>
> > >> <name>http.accept.language</name>
> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> > >> <description>Value of the "Accept-Language" request header field.
> > >> This allows selecting non-English language as default one to retrieve.
> > >> It is a useful setting for search engines build for certain national
> > group.
> > >> </description>
> > >> </property>
> > >>
> > >>
> > >>
> > >> Using a seed.txt with URL's I know are in the language I want, but as
> the
> > >> crawl grows it seems I'm starting to get more and more docs in other
> > >> languages.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thnx in advance
> > >>
> > >>
> > >
> > 
> > 
> > 
> 
> 

RE: Language identification

Posted by Markus Jelsma <ma...@openindex.io>.
If that's a public dependency you can look it up at maven.org and get the ivy config line for the dep. Add it to the plugin's ivy.xml and you're good to go

 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 22:33
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> I get following error in the logs:
> 
> WARN  plugin.PluginRepository - Missing dependency
> language-identifier-agmlab for plugin language-filter
> 
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
> 
> Hi Ralf,
> 
> I patched language-filter plugin for filter or accept pages which 
> specified languages while parse phase.
> 
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> 
> 
> On 02-11-2013 22:05, Julien Nioche wrote:
> > Ralf,
> >
> > The parameter http.accept.language tells the servers you are hitting that
> > they should provide you the content in the languages you specified but
> that
> > does not give you any guarantees nor allows you to filter the content.
> Look
> > at the languageidentifier plugin as a starting point, then you could add a
> > custom mapreduce job to remove the pages which are not in the languages of
> > interest.
> >
> > HTH
> >
> > Julien
> >
> >
> >
> > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> What is the correct process to only store documents in a desired
> language?
> >>
> >>
> >>
> >> I'm currently doing this:
> >>
> >>
> >>
> >> <property>
> >> <name>http.accept.language</name>
> >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> >> <description>Value of the "Accept-Language" request header field.
> >> This allows selecting non-English language as default one to retrieve.
> >> It is a useful setting for search engines build for certain national
> group.
> >> </description>
> >> </property>
> >>
> >>
> >>
> >> Using a seed.txt with URL's I know are in the language I want, but as the
> >> crawl grows it seems I'm starting to get more and more docs in other
> >> languages.
> >>
> >>
> >>
> >>
> >>
> >> Thnx in advance
> >>
> >>
> >
> 
> 
> 

RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
Thnx, -p1 made the difference.. heh

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, November 05, 2013 1:17 PM
To: user@nutch.apache.org
Subject: RE: Language identification

These are git patches and work differently then we are used to at the ASF
(a/ and b/ prefixes).
In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
patches.

 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 13:12
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> Thank you,
> 
> I'm still learning ow to patch nutch... not much luck so far...
> 
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
> 
> Hi Ralf,
> 
> I patched language-filter plugin for filter or accept pages which 
> specified languages while parse phase.
> 
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> 
> 
> On 02-11-2013 22:05, Julien Nioche wrote:
> > Ralf,
> >
> > The parameter http.accept.language tells the servers you are hitting
that
> > they should provide you the content in the languages you specified but
> that
> > does not give you any guarantees nor allows you to filter the content.
> Look
> > at the languageidentifier plugin as a starting point, then you could add
a
> > custom mapreduce job to remove the pages which are not in the languages
of
> > interest.
> >
> > HTH
> >
> > Julien
> >
> >
> >
> > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> What is the correct process to only store documents in a desired
> language?
> >>
> >>
> >>
> >> I'm currently doing this:
> >>
> >>
> >>
> >> <property>
> >> <name>http.accept.language</name>
> >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> >> <description>Value of the "Accept-Language" request header field.
> >> This allows selecting non-English language as default one to retrieve.
> >> It is a useful setting for search engines build for certain national
> group.
> >> </description>
> >> </property>
> >>
> >>
> >>
> >> Using a seed.txt with URL's I know are in the language I want, but as
the
> >> crawl grows it seems I'm starting to get more and more docs in other
> >> languages.
> >>
> >>
> >>
> >>
> >>
> >> Thnx in advance
> >>
> >>
> >
> 
> 
> 


RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
I patched it also against the tarball...

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, November 05, 2013 2:30 PM
To: user@nutch.apache.org
Subject: RE: Language identification

Ah, that patch is for the 2.x branch and it won't work on trunk but it can
be ported with relative ease but it'll take some time. 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 14:26
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> OK, when I do this on the SVN trunk I get:
> 
> blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
> language-filter.patch
> patching file conf/nutch-default.xml
> Hunk #1 succeeded at 941 (offset 19 lines).
> patching file ivy/ivy.xml
> Hunk #1 FAILED at 111.
> 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #1 succeeded at 30 with fuzz 1.
> Hunk #2 succeeded at 79 with fuzz 1.
> Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
> patching file src/plugin/language-filter/build.xml
> patching file src/plugin/language-filter/ivy.xml
> patching file src/plugin/language-filter/plugin.xml
> patching file
>
src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
> ter.java
> patching file
>
src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
> eFilter.java
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: Tuesday, November 05, 2013 1:17 PM
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> These are git patches and work differently then we are used to at the ASF
> (a/ and b/ prefixes).
> In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
> patches.
> 
>  
>  
> -----Original message-----
> > From:Ralf R. Kotowski <rr...@enlle.com>
> > Sent: Tuesday 5th November 2013 13:12
> > To: user@nutch.apache.org
> > Subject: RE: Language identification
> > 
> > Thank you,
> > 
> > I'm still learning ow to patch nutch... not much luck so far...
> > 
> > -----Original Message-----
> > From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> > Sent: Tuesday, November 05, 2013 10:36 AM
> > To: user@nutch.apache.org
> > Subject: Re: Language identification
> > 
> > Hi Ralf,
> > 
> > I patched language-filter plugin for filter or accept pages which 
> > specified languages while parse phase.
> > 
> > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> > 
> > 
> > On 02-11-2013 22:05, Julien Nioche wrote:
> > > Ralf,
> > >
> > > The parameter http.accept.language tells the servers you are hitting
> that
> > > they should provide you the content in the languages you specified but
> > that
> > > does not give you any guarantees nor allows you to filter the content.
> > Look
> > > at the languageidentifier plugin as a starting point, then you could
add
> a
> > > custom mapreduce job to remove the pages which are not in the
languages
> of
> > > interest.
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > >
> > > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >>
> > >> What is the correct process to only store documents in a desired
> > language?
> > >>
> > >>
> > >>
> > >> I'm currently doing this:
> > >>
> > >>
> > >>
> > >> <property>
> > >> <name>http.accept.language</name>
> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> > >> <description>Value of the "Accept-Language" request header field.
> > >> This allows selecting non-English language as default one to
retrieve.
> > >> It is a useful setting for search engines build for certain national
> > group.
> > >> </description>
> > >> </property>
> > >>
> > >>
> > >>
> > >> Using a seed.txt with URL's I know are in the language I want, but as
> the
> > >> crawl grows it seems I'm starting to get more and more docs in other
> > >> languages.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thnx in advance
> > >>
> > >>
> > >
> > 
> > 
> > 
> 
> 


RE: Language identification

Posted by Markus Jelsma <ma...@openindex.io>.
Ah, that patch is for the 2.x branch and it won't work on trunk but it can be ported with relative ease but it'll take some time. 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 14:26
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> OK, when I do this on the SVN trunk I get:
> 
> blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
> language-filter.patch
> patching file conf/nutch-default.xml
> Hunk #1 succeeded at 941 (offset 19 lines).
> patching file ivy/ivy.xml
> Hunk #1 FAILED at 111.
> 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #1 succeeded at 30 with fuzz 1.
> Hunk #2 succeeded at 79 with fuzz 1.
> Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
> patching file src/plugin/language-filter/build.xml
> patching file src/plugin/language-filter/ivy.xml
> patching file src/plugin/language-filter/plugin.xml
> patching file
> src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
> ter.java
> patching file
> src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
> eFilter.java
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: Tuesday, November 05, 2013 1:17 PM
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> These are git patches and work differently then we are used to at the ASF
> (a/ and b/ prefixes).
> In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
> patches.
> 
>  
>  
> -----Original message-----
> > From:Ralf R. Kotowski <rr...@enlle.com>
> > Sent: Tuesday 5th November 2013 13:12
> > To: user@nutch.apache.org
> > Subject: RE: Language identification
> > 
> > Thank you,
> > 
> > I'm still learning ow to patch nutch... not much luck so far...
> > 
> > -----Original Message-----
> > From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> > Sent: Tuesday, November 05, 2013 10:36 AM
> > To: user@nutch.apache.org
> > Subject: Re: Language identification
> > 
> > Hi Ralf,
> > 
> > I patched language-filter plugin for filter or accept pages which 
> > specified languages while parse phase.
> > 
> > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> > 
> > 
> > On 02-11-2013 22:05, Julien Nioche wrote:
> > > Ralf,
> > >
> > > The parameter http.accept.language tells the servers you are hitting
> that
> > > they should provide you the content in the languages you specified but
> > that
> > > does not give you any guarantees nor allows you to filter the content.
> > Look
> > > at the languageidentifier plugin as a starting point, then you could add
> a
> > > custom mapreduce job to remove the pages which are not in the languages
> of
> > > interest.
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > >
> > > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >>
> > >> What is the correct process to only store documents in a desired
> > language?
> > >>
> > >>
> > >>
> > >> I'm currently doing this:
> > >>
> > >>
> > >>
> > >> <property>
> > >> <name>http.accept.language</name>
> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> > >> <description>Value of the "Accept-Language" request header field.
> > >> This allows selecting non-English language as default one to retrieve.
> > >> It is a useful setting for search engines build for certain national
> > group.
> > >> </description>
> > >> </property>
> > >>
> > >>
> > >>
> > >> Using a seed.txt with URL's I know are in the language I want, but as
> the
> > >> crawl grows it seems I'm starting to get more and more docs in other
> > >> languages.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thnx in advance
> > >>
> > >>
> > >
> > 
> > 
> > 
> 
> 

RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
OK, when I do this on the SVN trunk I get:

blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 <
language-filter.patch
patching file conf/nutch-default.xml
Hunk #1 succeeded at 941 (offset 19 lines).
patching file ivy/ivy.xml
Hunk #1 FAILED at 111.
1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej
patching file src/plugin/build.xml
Hunk #1 succeeded at 30 with fuzz 1.
Hunk #2 succeeded at 79 with fuzz 1.
Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines).
patching file src/plugin/language-filter/build.xml
patching file src/plugin/language-filter/ivy.xml
patching file src/plugin/language-filter/plugin.xml
patching file
src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil
ter.java
patching file
src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag
eFilter.java

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Tuesday, November 05, 2013 1:17 PM
To: user@nutch.apache.org
Subject: RE: Language identification

These are git patches and work differently then we are used to at the ASF
(a/ and b/ prefixes).
In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based
patches.

 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 13:12
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> Thank you,
> 
> I'm still learning ow to patch nutch... not much luck so far...
> 
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
> 
> Hi Ralf,
> 
> I patched language-filter plugin for filter or accept pages which 
> specified languages while parse phase.
> 
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> 
> 
> On 02-11-2013 22:05, Julien Nioche wrote:
> > Ralf,
> >
> > The parameter http.accept.language tells the servers you are hitting
that
> > they should provide you the content in the languages you specified but
> that
> > does not give you any guarantees nor allows you to filter the content.
> Look
> > at the languageidentifier plugin as a starting point, then you could add
a
> > custom mapreduce job to remove the pages which are not in the languages
of
> > interest.
> >
> > HTH
> >
> > Julien
> >
> >
> >
> > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> What is the correct process to only store documents in a desired
> language?
> >>
> >>
> >>
> >> I'm currently doing this:
> >>
> >>
> >>
> >> <property>
> >> <name>http.accept.language</name>
> >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> >> <description>Value of the "Accept-Language" request header field.
> >> This allows selecting non-English language as default one to retrieve.
> >> It is a useful setting for search engines build for certain national
> group.
> >> </description>
> >> </property>
> >>
> >>
> >>
> >> Using a seed.txt with URL's I know are in the language I want, but as
the
> >> crawl grows it seems I'm starting to get more and more docs in other
> >> languages.
> >>
> >>
> >>
> >>
> >>
> >> Thnx in advance
> >>
> >>
> >
> 
> 
> 


RE: Language identification

Posted by Markus Jelsma <ma...@openindex.io>.
These are git patches and work differently then we are used to at the ASF (a/ and b/ prefixes).
In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based patches.

 
 
-----Original message-----
> From:Ralf R. Kotowski <rr...@enlle.com>
> Sent: Tuesday 5th November 2013 13:12
> To: user@nutch.apache.org
> Subject: RE: Language identification
> 
> Thank you,
> 
> I'm still learning ow to patch nutch... not much luck so far...
> 
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
> 
> Hi Ralf,
> 
> I patched language-filter plugin for filter or accept pages which 
> specified languages while parse phase.
> 
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
> 
> 
> On 02-11-2013 22:05, Julien Nioche wrote:
> > Ralf,
> >
> > The parameter http.accept.language tells the servers you are hitting that
> > they should provide you the content in the languages you specified but
> that
> > does not give you any guarantees nor allows you to filter the content.
> Look
> > at the languageidentifier plugin as a starting point, then you could add a
> > custom mapreduce job to remove the pages which are not in the languages of
> > interest.
> >
> > HTH
> >
> > Julien
> >
> >
> >
> > On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> What is the correct process to only store documents in a desired
> language?
> >>
> >>
> >>
> >> I'm currently doing this:
> >>
> >>
> >>
> >> <property>
> >> <name>http.accept.language</name>
> >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> >> <description>Value of the "Accept-Language" request header field.
> >> This allows selecting non-English language as default one to retrieve.
> >> It is a useful setting for search engines build for certain national
> group.
> >> </description>
> >> </property>
> >>
> >>
> >>
> >> Using a seed.txt with URL's I know are in the language I want, but as the
> >> crawl grows it seems I'm starting to get more and more docs in other
> >> languages.
> >>
> >>
> >>
> >>
> >>
> >> Thnx in advance
> >>
> >>
> >
> 
> 
> 

RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
Thank you,

I'm still learning ow to patch nutch... not much luck so far...

-----Original Message-----
From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
Sent: Tuesday, November 05, 2013 10:36 AM
To: user@nutch.apache.org
Subject: Re: Language identification

Hi Ralf,

I patched language-filter plugin for filter or accept pages which 
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:
> Ralf,
>
> The parameter http.accept.language tells the servers you are hitting that
> they should provide you the content in the languages you specified but
that
> does not give you any guarantees nor allows you to filter the content.
Look
> at the languageidentifier plugin as a starting point, then you could add a
> custom mapreduce job to remove the pages which are not in the languages of
> interest.
>
> HTH
>
> Julien
>
>
>
> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>
>> Hi,
>>
>>
>>
>> What is the correct process to only store documents in a desired
language?
>>
>>
>>
>> I'm currently doing this:
>>
>>
>>
>> <property>
>> <name>http.accept.language</name>
>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>> <description>Value of the "Accept-Language" request header field.
>> This allows selecting non-English language as default one to retrieve.
>> It is a useful setting for search engines build for certain national
group.
>> </description>
>> </property>
>>
>>
>>
>> Using a seed.txt with URL's I know are in the language I want, but as the
>> crawl grows it seems I'm starting to get more and more docs in other
>> languages.
>>
>>
>>
>>
>>
>> Thnx in advance
>>
>>
>



Re: Language identification

Posted by ilhami Kalkan <il...@agmlab.com>.
Yes.


On 08-11-2013 17:41, Ralf R. Kotowski wrote:
> We are talking about this plug-in, correct?
>
>
> http://wiki.apache.org/nutch/LanguageIdentifierPlugin
>
>
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Thursday, November 07, 2013 10:29 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Rulf,
>
> Short answer is no.
> This plugin run after language-idendifier plugin. Because,
> languge-identifier plugin marks metadata language and this plugin get
> this value to filter or accept language while parse phase.
> language-identifier plugin gets lang value from header or decide lang
> value with page content's n-gram.
> language-filter plugin get "language.filter.languages" entries which
> must be ISO-639 language codes and match them with metadata lang. Page
> languages like en-us were rejected. Thanks for heads-up. I added
> necessary control in patch to prevent this case.
>
>
> On 06-11-2013 23:52, Ralf R. Kotowski wrote:
>> Hi,
>>
>> I have run several passes, I no Langer get the bulk of foreign language
>> sites I used to, but some others which are supossed to I don't get either.
>>
>> Does this plug-in work trough the HTML header? Because I got one of the
> ones
>> that are not supossed to be there with this header:
>>
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
>>
>> -----Original Message-----
>> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
>> Sent: Wednesday, November 06, 2013 9:08 AM
>> To: user@nutch.apache.org
>> Subject: Re: Language identification
>>
>> Hi Ralf,
>>
>> language-identifier-agmlab is my test plugin name. I fixed the patch.
>>
>> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>>
>> On 06-11-2013 00:50, Ralf R. Kotowski wrote:
>>> I get following error in the logs:
>>>
>>> WARN  plugin.PluginRepository - Missing dependency
>>> language-identifier-agmlab for plugin language-filter
>>>
>>> -----Original Message-----
>>> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
>>> Sent: Tuesday, November 05, 2013 10:36 AM
>>> To: user@nutch.apache.org
>>> Subject: Re: Language identification
>>>
>>> Hi Ralf,
>>>
>>> I patched language-filter plugin for filter or accept pages which
>>> specified languages while parse phase.
>>>
>>> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>>>
>>>
>>> On 02-11-2013 22:05, Julien Nioche wrote:
>>>> Ralf,
>>>>
>>>> The parameter http.accept.language tells the servers you are hitting
> that
>>>> they should provide you the content in the languages you specified but
>>> that
>>>> does not give you any guarantees nor allows you to filter the content.
>>> Look
>>>> at the languageidentifier plugin as a starting point, then you could add
>> a
>>>> custom mapreduce job to remove the pages which are not in the languages
>> of
>>>> interest.
>>>>
>>>> HTH
>>>>
>>>> Julien
>>>>
>>>>
>>>>
>>>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> What is the correct process to only store documents in a desired
>>> language?
>>>>> I'm currently doing this:
>>>>>
>>>>>
>>>>>
>>>>> <property>
>>>>> <name>http.accept.language</name>
>>>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>>>> <description>Value of the "Accept-Language" request header field.
>>>>> This allows selecting non-English language as default one to retrieve.
>>>>> It is a useful setting for search engines build for certain national
>>> group.
>>>>> </description>
>>>>> </property>
>>>>>
>>>>>
>>>>>
>>>>> Using a seed.txt with URL's I know are in the language I want, but as
>> the
>>>>> crawl grows it seems I'm starting to get more and more docs in other
>>>>> languages.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thnx in advance
>>>>>
>>>>>
>


RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
We are talking about this plug-in, correct?


http://wiki.apache.org/nutch/LanguageIdentifierPlugin



-----Original Message-----
From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
Sent: Thursday, November 07, 2013 10:29 AM
To: user@nutch.apache.org
Subject: Re: Language identification

Hi Rulf,

Short answer is no.
This plugin run after language-idendifier plugin. Because, 
languge-identifier plugin marks metadata language and this plugin get 
this value to filter or accept language while parse phase. 
language-identifier plugin gets lang value from header or decide lang 
value with page content's n-gram.
language-filter plugin get "language.filter.languages" entries which 
must be ISO-639 language codes and match them with metadata lang. Page 
languages like en-us were rejected. Thanks for heads-up. I added 
necessary control in patch to prevent this case.


On 06-11-2013 23:52, Ralf R. Kotowski wrote:
> Hi,
>
> I have run several passes, I no Langer get the bulk of foreign language
> sites I used to, but some others which are supossed to I don't get either.
>
> Does this plug-in work trough the HTML header? Because I got one of the
ones
> that are not supossed to be there with this header:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Wednesday, November 06, 2013 9:08 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Ralf,
>
> language-identifier-agmlab is my test plugin name. I fixed the patch.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
> On 06-11-2013 00:50, Ralf R. Kotowski wrote:
>> I get following error in the logs:
>>
>> WARN  plugin.PluginRepository - Missing dependency
>> language-identifier-agmlab for plugin language-filter
>>
>> -----Original Message-----
>> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
>> Sent: Tuesday, November 05, 2013 10:36 AM
>> To: user@nutch.apache.org
>> Subject: Re: Language identification
>>
>> Hi Ralf,
>>
>> I patched language-filter plugin for filter or accept pages which
>> specified languages while parse phase.
>>
>> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>>
>>
>> On 02-11-2013 22:05, Julien Nioche wrote:
>>> Ralf,
>>>
>>> The parameter http.accept.language tells the servers you are hitting
that
>>> they should provide you the content in the languages you specified but
>> that
>>> does not give you any guarantees nor allows you to filter the content.
>> Look
>>> at the languageidentifier plugin as a starting point, then you could add
> a
>>> custom mapreduce job to remove the pages which are not in the languages
> of
>>> interest.
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>>
>>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> What is the correct process to only store documents in a desired
>> language?
>>>>
>>>> I'm currently doing this:
>>>>
>>>>
>>>>
>>>> <property>
>>>> <name>http.accept.language</name>
>>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>>> <description>Value of the "Accept-Language" request header field.
>>>> This allows selecting non-English language as default one to retrieve.
>>>> It is a useful setting for search engines build for certain national
>> group.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>>
>>>> Using a seed.txt with URL's I know are in the language I want, but as
> the
>>>> crawl grows it seems I'm starting to get more and more docs in other
>>>> languages.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thnx in advance
>>>>
>>>>
>



Re: Language identification

Posted by ilhami Kalkan <il...@agmlab.com>.
Hi Rulf,

Short answer is no.
This plugin run after language-idendifier plugin. Because, 
languge-identifier plugin marks metadata language and this plugin get 
this value to filter or accept language while parse phase. 
language-identifier plugin gets lang value from header or decide lang 
value with page content's n-gram.
language-filter plugin get "language.filter.languages" entries which 
must be ISO-639 language codes and match them with metadata lang. Page 
languages like en-us were rejected. Thanks for heads-up. I added 
necessary control in patch to prevent this case.


On 06-11-2013 23:52, Ralf R. Kotowski wrote:
> Hi,
>
> I have run several passes, I no Langer get the bulk of foreign language
> sites I used to, but some others which are supossed to I don't get either.
>
> Does this plug-in work trough the HTML header? Because I got one of the ones
> that are not supossed to be there with this header:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Wednesday, November 06, 2013 9:08 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Ralf,
>
> language-identifier-agmlab is my test plugin name. I fixed the patch.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
> On 06-11-2013 00:50, Ralf R. Kotowski wrote:
>> I get following error in the logs:
>>
>> WARN  plugin.PluginRepository - Missing dependency
>> language-identifier-agmlab for plugin language-filter
>>
>> -----Original Message-----
>> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
>> Sent: Tuesday, November 05, 2013 10:36 AM
>> To: user@nutch.apache.org
>> Subject: Re: Language identification
>>
>> Hi Ralf,
>>
>> I patched language-filter plugin for filter or accept pages which
>> specified languages while parse phase.
>>
>> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>>
>>
>> On 02-11-2013 22:05, Julien Nioche wrote:
>>> Ralf,
>>>
>>> The parameter http.accept.language tells the servers you are hitting that
>>> they should provide you the content in the languages you specified but
>> that
>>> does not give you any guarantees nor allows you to filter the content.
>> Look
>>> at the languageidentifier plugin as a starting point, then you could add
> a
>>> custom mapreduce job to remove the pages which are not in the languages
> of
>>> interest.
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>>
>>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> What is the correct process to only store documents in a desired
>> language?
>>>>
>>>> I'm currently doing this:
>>>>
>>>>
>>>>
>>>> <property>
>>>> <name>http.accept.language</name>
>>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>>> <description>Value of the "Accept-Language" request header field.
>>>> This allows selecting non-English language as default one to retrieve.
>>>> It is a useful setting for search engines build for certain national
>> group.
>>>> </description>
>>>> </property>
>>>>
>>>>
>>>>
>>>> Using a seed.txt with URL's I know are in the language I want, but as
> the
>>>> crawl grows it seems I'm starting to get more and more docs in other
>>>> languages.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thnx in advance
>>>>
>>>>
>


RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
Hi,

I have run several passes, I no Langer get the bulk of foreign language
sites I used to, but some others which are supossed to I don't get either.

Does this plug-in work trough the HTML header? Because I got one of the ones
that are not supossed to be there with this header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us">

-----Original Message-----
From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
Sent: Wednesday, November 06, 2013 9:08 AM
To: user@nutch.apache.org
Subject: Re: Language identification

Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:
> I get following error in the logs:
>
> WARN  plugin.PluginRepository - Missing dependency
> language-identifier-agmlab for plugin language-filter
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Ralf,
>
> I patched language-filter plugin for filter or accept pages which
> specified languages while parse phase.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
>
> On 02-11-2013 22:05, Julien Nioche wrote:
>> Ralf,
>>
>> The parameter http.accept.language tells the servers you are hitting that
>> they should provide you the content in the languages you specified but
> that
>> does not give you any guarantees nor allows you to filter the content.
> Look
>> at the languageidentifier plugin as a starting point, then you could add
a
>> custom mapreduce job to remove the pages which are not in the languages
of
>> interest.
>>
>> HTH
>>
>> Julien
>>
>>
>>
>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> What is the correct process to only store documents in a desired
> language?
>>>
>>>
>>> I'm currently doing this:
>>>
>>>
>>>
>>> <property>
>>> <name>http.accept.language</name>
>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>> <description>Value of the "Accept-Language" request header field.
>>> This allows selecting non-English language as default one to retrieve.
>>> It is a useful setting for search engines build for certain national
> group.
>>> </description>
>>> </property>
>>>
>>>
>>>
>>> Using a seed.txt with URL's I know are in the language I want, but as
the
>>> crawl grows it seems I'm starting to get more and more docs in other
>>> languages.
>>>
>>>
>>>
>>>
>>>
>>> Thnx in advance
>>>
>>>
>



RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
Thank you very much,

I'm testing it right now, so far when trying with only this URL:
http://www.todalaprensa.com/ as a seed, nutch only retrieves this page and
nothing else. When using a larger seed list it seems to work, I'm currently
on the 3rd pass, I'll let you know how it goes as it is still running.

-----Original Message-----
From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
Sent: Wednesday, November 06, 2013 9:08 AM
To: user@nutch.apache.org
Subject: Re: Language identification

Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:
> I get following error in the logs:
>
> WARN  plugin.PluginRepository - Missing dependency
> language-identifier-agmlab for plugin language-filter
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Ralf,
>
> I patched language-filter plugin for filter or accept pages which
> specified languages while parse phase.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
>
> On 02-11-2013 22:05, Julien Nioche wrote:
>> Ralf,
>>
>> The parameter http.accept.language tells the servers you are hitting that
>> they should provide you the content in the languages you specified but
> that
>> does not give you any guarantees nor allows you to filter the content.
> Look
>> at the languageidentifier plugin as a starting point, then you could add
a
>> custom mapreduce job to remove the pages which are not in the languages
of
>> interest.
>>
>> HTH
>>
>> Julien
>>
>>
>>
>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> What is the correct process to only store documents in a desired
> language?
>>>
>>>
>>> I'm currently doing this:
>>>
>>>
>>>
>>> <property>
>>> <name>http.accept.language</name>
>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>> <description>Value of the "Accept-Language" request header field.
>>> This allows selecting non-English language as default one to retrieve.
>>> It is a useful setting for search engines build for certain national
> group.
>>> </description>
>>> </property>
>>>
>>>
>>>
>>> Using a seed.txt with URL's I know are in the language I want, but as
the
>>> crawl grows it seems I'm starting to get more and more docs in other
>>> languages.
>>>
>>>
>>>
>>>
>>>
>>> Thnx in advance
>>>
>>>
>



Re: Language identification

Posted by ilhami Kalkan <il...@agmlab.com>.
Hi Ralf,

language-identifier-agmlab is my test plugin name. I fixed the patch.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>

On 06-11-2013 00:50, Ralf R. Kotowski wrote:
> I get following error in the logs:
>
> WARN  plugin.PluginRepository - Missing dependency
> language-identifier-agmlab for plugin language-filter
>
> -----Original Message-----
> From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com]
> Sent: Tuesday, November 05, 2013 10:36 AM
> To: user@nutch.apache.org
> Subject: Re: Language identification
>
> Hi Ralf,
>
> I patched language-filter plugin for filter or accept pages which
> specified languages while parse phase.
>
> NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>
>
>
> On 02-11-2013 22:05, Julien Nioche wrote:
>> Ralf,
>>
>> The parameter http.accept.language tells the servers you are hitting that
>> they should provide you the content in the languages you specified but
> that
>> does not give you any guarantees nor allows you to filter the content.
> Look
>> at the languageidentifier plugin as a starting point, then you could add a
>> custom mapreduce job to remove the pages which are not in the languages of
>> interest.
>>
>> HTH
>>
>> Julien
>>
>>
>>
>> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> What is the correct process to only store documents in a desired
> language?
>>>
>>>
>>> I'm currently doing this:
>>>
>>>
>>>
>>> <property>
>>> <name>http.accept.language</name>
>>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>>> <description>Value of the "Accept-Language" request header field.
>>> This allows selecting non-English language as default one to retrieve.
>>> It is a useful setting for search engines build for certain national
> group.
>>> </description>
>>> </property>
>>>
>>>
>>>
>>> Using a seed.txt with URL's I know are in the language I want, but as the
>>> crawl grows it seems I'm starting to get more and more docs in other
>>> languages.
>>>
>>>
>>>
>>>
>>>
>>> Thnx in advance
>>>
>>>
>


RE: Language identification

Posted by "Ralf R. Kotowski" <rr...@enlle.com>.
I get following error in the logs:

WARN  plugin.PluginRepository - Missing dependency
language-identifier-agmlab for plugin language-filter

-----Original Message-----
From: ilhami Kalkan [mailto:ilhami.kalkan@agmlab.com] 
Sent: Tuesday, November 05, 2013 10:36 AM
To: user@nutch.apache.org
Subject: Re: Language identification

Hi Ralf,

I patched language-filter plugin for filter or accept pages which 
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:
> Ralf,
>
> The parameter http.accept.language tells the servers you are hitting that
> they should provide you the content in the languages you specified but
that
> does not give you any guarantees nor allows you to filter the content.
Look
> at the languageidentifier plugin as a starting point, then you could add a
> custom mapreduce job to remove the pages which are not in the languages of
> interest.
>
> HTH
>
> Julien
>
>
>
> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>
>> Hi,
>>
>>
>>
>> What is the correct process to only store documents in a desired
language?
>>
>>
>>
>> I'm currently doing this:
>>
>>
>>
>> <property>
>> <name>http.accept.language</name>
>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>> <description>Value of the "Accept-Language" request header field.
>> This allows selecting non-English language as default one to retrieve.
>> It is a useful setting for search engines build for certain national
group.
>> </description>
>> </property>
>>
>>
>>
>> Using a seed.txt with URL's I know are in the language I want, but as the
>> crawl grows it seems I'm starting to get more and more docs in other
>> languages.
>>
>>
>>
>>
>>
>> Thnx in advance
>>
>>
>



Re: Language identification

Posted by ilhami Kalkan <il...@agmlab.com>.
Hi Ralf,

I patched language-filter plugin for filter or accept pages which 
specified languages while parse phase.

NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663>


On 02-11-2013 22:05, Julien Nioche wrote:
> Ralf,
>
> The parameter http.accept.language tells the servers you are hitting that
> they should provide you the content in the languages you specified but that
> does not give you any guarantees nor allows you to filter the content. Look
> at the languageidentifier plugin as a starting point, then you could add a
> custom mapreduce job to remove the pages which are not in the languages of
> interest.
>
> HTH
>
> Julien
>
>
>
> On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:
>
>> Hi,
>>
>>
>>
>> What is the correct process to only store documents in a desired language?
>>
>>
>>
>> I'm currently doing this:
>>
>>
>>
>> <property>
>> <name>http.accept.language</name>
>> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
>> <description>Value of the "Accept-Language" request header field.
>> This allows selecting non-English language as default one to retrieve.
>> It is a useful setting for search engines build for certain national group.
>> </description>
>> </property>
>>
>>
>>
>> Using a seed.txt with URL's I know are in the language I want, but as the
>> crawl grows it seems I'm starting to get more and more docs in other
>> languages.
>>
>>
>>
>>
>>
>> Thnx in advance
>>
>>
>


Re: Language identification

Posted by Julien Nioche <li...@gmail.com>.
Ralf,

The parameter http.accept.language tells the servers you are hitting that
they should provide you the content in the languages you specified but that
does not give you any guarantees nor allows you to filter the content. Look
at the languageidentifier plugin as a starting point, then you could add a
custom mapreduce job to remove the pages which are not in the languages of
interest.

HTH

Julien



On 2 November 2013 17:15, Ralf R. Kotowski <rr...@enlle.com> wrote:

> Hi,
>
>
>
> What is the correct process to only store documents in a desired language?
>
>
>
> I'm currently doing this:
>
>
>
> <property>
> <name>http.accept.language</name>
> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
> <description>Value of the "Accept-Language" request header field.
> This allows selecting non-English language as default one to retrieve.
> It is a useful setting for search engines build for certain national group.
> </description>
> </property>
>
>
>
> Using a seed.txt with URL's I know are in the language I want, but as the
> crawl grows it seems I'm starting to get more and more docs in other
> languages.
>
>
>
>
>
> Thnx in advance
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble