You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by David Jashi <da...@jashi.ge> on 2009/09/29 16:59:52 UTC

Multilanguage support in Nutch 1.0

Hello, all.

I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

I have fresh install of Nutch and two analysis plugins I'd like to turn on:
analysis-de (German) and analysis-ge (Georgian)
Here are the innards of my seed file:
-----------------------
http://212.72.133.54/l/test.html
http://212.72.133.54/l/de.html
-----------------------
The first is Georgian, other - German. When I run

bin/nutch crawl seed -dir crawl -threads 10 -depth 2

there is not a slightest sign of someone calling any analysis
plug-ins, even though it's clearly stated in hadoop.log, that they are
on and active:
-----------------------
2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
looking in: C:\cygwin\opt\nutch\plugins
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
Analysers (lib-lucene-analyzers)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
Identification Parser/Filter (language-identifier)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)

!!!!!!!!!!!!!!!!!!!!!!!!!
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
Analysis Plug-in (analysis-ge)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
Analysis Plug-in (analysis-de)
!!!!!!!!!!!!!!!!!!!!!!!!!

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
URL Normalizer (urlnormalizer-pass)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
-----------------------

At the same time:

-----------------------
2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
identifier configuration [1-4/2048]
2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
nl(1000)
-----------------------

Language indentifier works as a charm at the same time:
-----------------------
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/test.html
text was identified as ge
-----------------------
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/de.html
text was identified as de
-----------------------

What could have possibly gone wrong?

პატივისცემით,
დავით ჯაში

Re: Multilanguage support in Nutch 1.0

Posted by David Jashi <da...@jashi.ge>.
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM <mb...@msn.com> wrote:
>
> hi
>
> try to activate the language-identifier plugin
> you must add it in the nutch-site.xml file in the  <name>plugin.includes</name> section.

Shame on me! Thanks a lot.

>
> it's some thing like that
>
>
>
> <property>
>  <name>plugin.includes</name>
>  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with the
>  underlying commons-httpclient library.
>  </description>
> </property>
>
>
>> From: david@jashi.ge
>> Date: Tue, 29 Sep 2009 18:59:52 +0400
>> Subject: Multilanguage support in Nutch 1.0
>> To: nutch-user@lucene.apache.org
>>
>> Hello, all.
>>
>> I've got a bit of a trouble with Nutch 1.0 and multilanguage support:
>>
>> I have fresh install of Nutch and two analysis plugins I'd like to turn on:
>> analysis-de (German) and analysis-ge (Georgian)
>> Here are the innards of my seed file:
>> -----------------------
>> http://212.72.133.54/l/test.html
>> http://212.72.133.54/l/de.html
>> -----------------------
>> The first is Georgian, other - German. When I run
>>
>> bin/nutch crawl seed -dir crawl -threads 10 -depth 2
>>
>> there is not a slightest sign of someone calling any analysis
>> plug-ins, even though it's clearly stated in hadoop.log, that they are
>> on and active:
>> -----------------------
>> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
>> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
>> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
>> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
>> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
>> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
>> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
>> 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
>> GenericOptionsParser for parsing the arguments. Applications should
>> implement Tool for the same.
>> 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
>> looking in: C:\cygwin\opt\nutch\plugins
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
>> core extension points (nutch-extensionpoints)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
>> Filter (query-basic)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
>> Analysers (lib-lucene-analyzers)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
>> Normalizer (urlnormalizer-basic)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
>> Identification Parser/Filter (language-identifier)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
>> Plug-in (parse-html)
>>
>> !!!!!!!!!!!!!!!!!!!!!!!!!
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
>> Analysis Plug-in (analysis-ge)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
>> Analysis Plug-in (analysis-de)
>> !!!!!!!!!!!!!!!!!!!!!!!!!
>>
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
>> Indexing Filter (index-basic)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
>> Summarizer Plug-in (summary-basic)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
>> Filter (query-site)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
>> Framework (lib-http)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
>> Plug-in (parse-text)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
>> Filter (query-more)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
>> Filter (urlfilter-regex)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
>> URL Normalizer (urlnormalizer-pass)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
>> Plug-in (protocol-http)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
>> Normalizer (urlnormalizer-regex)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
>> Plug-in (scoring-opic)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
>> HTML Parser (lib-nekohtml)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
>> Parser (parse-js)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
>> Filter (query-url)
>> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
>> Filter Framework (lib-regex-filter)
>> -----------------------
>>
>> At the same time:
>>
>> -----------------------
>> 2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
>> identifier configuration [1-4/2048]
>> 2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
>> identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
>> sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
>> el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
>> nl(1000)
>> -----------------------
>>
>> Language indentifier works as a charm at the same time:
>> -----------------------
>> $ bin/nutch plugin language-identifier
>> org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
>> http://212.72.133.54/l/test.html
>> text was identified as ge
>> -----------------------
>> $ bin/nutch plugin language-identifier
>> org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
>> http://212.72.133.54/l/de.html
>> text was identified as de
>> -----------------------
>>
>> What could have possibly gone wrong?
>>
>> პატივისცემით,
>> დავით ჯაში
>
> _________________________________________________________________
> Windows Live helps you keep up with all your friends, in one place.
> http://go.microsoft.com/?linkid=9660826

RE: Multilanguage support in Nutch 1.0

Posted by BELLINI ADAM <mb...@msn.com>.
hi,
do you have some metadata 'lang' on the pages . becoz the plugin try first to get the language form metadata..
if you see in the java source of the plugin LanguageIndexingFilter.java


    // check if LANGUAGE found, possibly put there by HTMLLanguageParser
    String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);

    // check if HTTP-header tels us the language
    if (lang == null) {
        lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE);
    }

try to use also LUKE to check all your metadata on the index.





> From: david@jashi.ge
> Date: Wed, 30 Sep 2009 17:22:26 +0400
> Subject: Re: Multilanguage support in Nutch 1.0
> To: nutch-user@lucene.apache.org
> 
> On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM <mb...@msn.com> wrote:
> >
> > hi
> >
> > try to activate the language-identifier plugin
> > you must add it in the nutch-site.xml file in the  <name>plugin.includes</name> section.
> 
> Ooops. It IS activated.
> 
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
> Language Identification Parser/Filter (language-identifier)
> 
> But fetched pages are not passed to it, as I recon.
 		 	   		  
_________________________________________________________________
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Re: Multilanguage support in Nutch 1.0

Posted by David Jashi <da...@jashi.ge>.
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM <mb...@msn.com> wrote:
>
> hi
>
> try to activate the language-identifier plugin
> you must add it in the nutch-site.xml file in the  <name>plugin.includes</name> section.

Ooops. It IS activated.

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
Language Identification Parser/Filter (language-identifier)

But fetched pages are not passed to it, as I recon.

RE: Multilanguage support in Nutch 1.0

Posted by BELLINI ADAM <mb...@msn.com>.
hi 

try to activate the language-identifier plugin
you must add it in the nutch-site.xml file in the  <name>plugin.includes</name> section.

it's some thing like that 



<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>


> From: david@jashi.ge
> Date: Tue, 29 Sep 2009 18:59:52 +0400
> Subject: Multilanguage support in Nutch 1.0
> To: nutch-user@lucene.apache.org
> 
> Hello, all.
> 
> I've got a bit of a trouble with Nutch 1.0 and multilanguage support:
> 
> I have fresh install of Nutch and two analysis plugins I'd like to turn on:
> analysis-de (German) and analysis-ge (Georgian)
> Here are the innards of my seed file:
> -----------------------
> http://212.72.133.54/l/test.html
> http://212.72.133.54/l/de.html
> -----------------------
> The first is Georgian, other - German. When I run
> 
> bin/nutch crawl seed -dir crawl -threads 10 -depth 2
> 
> there is not a slightest sign of someone calling any analysis
> plug-ins, even though it's clearly stated in hadoop.log, that they are
> on and active:
> -----------------------
> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
> 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
> 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
> 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
> GenericOptionsParser for parsing the arguments. Applications should
> implement Tool for the same.
> 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
> looking in: C:\cygwin\opt\nutch\plugins
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
> core extension points (nutch-extensionpoints)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
> Filter (query-basic)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
> Analysers (lib-lucene-analyzers)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
> Normalizer (urlnormalizer-basic)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
> Identification Parser/Filter (language-identifier)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
> Plug-in (parse-html)
> 
> !!!!!!!!!!!!!!!!!!!!!!!!!
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
> Analysis Plug-in (analysis-ge)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
> Analysis Plug-in (analysis-de)
> !!!!!!!!!!!!!!!!!!!!!!!!!
> 
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
> Indexing Filter (index-basic)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
> Summarizer Plug-in (summary-basic)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
> Filter (query-site)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
> Framework (lib-http)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
> Plug-in (parse-text)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
> Filter (query-more)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
> Filter (urlfilter-regex)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
> URL Normalizer (urlnormalizer-pass)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
> Plug-in (protocol-http)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
> Normalizer (urlnormalizer-regex)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
> Plug-in (scoring-opic)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
> HTML Parser (lib-nekohtml)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
> Parser (parse-js)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
> Filter (query-url)
> 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
> Filter Framework (lib-regex-filter)
> -----------------------
> 
> At the same time:
> 
> -----------------------
> 2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
> identifier configuration [1-4/2048]
> 2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
> identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
> sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
> el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
> nl(1000)
> -----------------------
> 
> Language indentifier works as a charm at the same time:
> -----------------------
> $ bin/nutch plugin language-identifier
> org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
> http://212.72.133.54/l/test.html
> text was identified as ge
> -----------------------
> $ bin/nutch plugin language-identifier
> org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
> http://212.72.133.54/l/de.html
> text was identified as de
> -----------------------
> 
> What could have possibly gone wrong?
> 
> პატივისცემით,
> დავით ჯაში
 		 	   		  
_________________________________________________________________
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826