You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Laurent Michenaud <lm...@adeuza.fr> on 2006/03/03 15:14:19 UTC

nutch and multilingualism

Hi,

 

What is the good strategy to adopt for multilingualism sites ?

 

I want nutch to index a site in the different languages and

then, the search only prints results that are in the user language.

 

Thanks for advices please.

Re: nutch and multilingualism

Posted by Zaheed Haque <za...@gmail.com>.

I think its a very good idea. It will be even better if one could
create a separate Crawl script just for ngram creation where one could
add their own URL for example  national libraries URL or etc.. My
thinking is that

bin/nutch ngram

which is similler to crawl one shot intranet searching but only for
ngram creation. instead on using crawl-urlfilter we will use
crawl-ngram or something..

just my two cents :-)

Cheers

On 3/6/06, Ivan Sekulovic <se...@net.yu> wrote:
> Hi Jerome!
>
> Would it be possible to generate ngram profiles for LanguageIdentifier
> plugin from crawled content and not from file? What is my idea? The best
> source for content in one language could be wikipedia.org.  We would
> just crawl the wikipedia in desired language and then create ngram
> profile from it. What are your thoughts about this idea?
>
> Best Regards,
> Ivan
>
>
>
> Jérôme Charron wrote:
>
> >>What is the good strategy to adopt for multilingualism sites ?
> >>
> >>
> >
> >I want nutch to index a site in the different languages and
> >
> >
> >>then, the search only prints results that are in the user language.
> >>
> >>
> >
> >Hi Laurent,
> >
> >What I can suggest is to :
> >1. use the languageidentifier plugin while crawling in order to guess the
> >language of the content
> >2. automatically filters the results by adding the lang:<user_agent_lang>
> >clause to the query (could be done in the jsp).
> >
> >Jérôme
> >
> >--
> >http://motrech.free.fr/
> >http://www.frutch.org/
> >
> >
> >
> >------------------------------------------------------------------------
> >
> >No virus found in this incoming message.
> >Checked by AVG Free Edition.
> >Version: 7.1.375 / Virus Database: 268.1.1/273 - Release Date: 2.3.2006
> >
> >
>
>
>


--
Best Regards
Zaheed Haque
Phone : +46 735 000006
E.mail: zaheed.haque@gmail.com

Re: nutch and multilingualism

Posted by Wray Buntine <bu...@hiit.fi>.

Ivan Sekulovic wrote:

> Jérôme Charron wrote:
>
>>> Would it be possible to generate ngram profiles for LanguageIdentifier
>>> plugin from crawled content and not from file? What is my idea? The 
>>> best
>>> source for content in one language could be wikipedia.org.  We would
>>> just crawl the wikipedia in desired language and then create ngram
>>> profile from it. What are your thoughts about this idea?
>>>   
>>
>>
>> I think it could be a good idea.
>> Wikipedia could be a good source (not sure the best one).
>> But instead of crawling wikipedia, it would probably be easier to 
>> download a
>> wikipedia dump
>> (http://download.wikimedia.org/)  and then extracts its textual 
>> content to a
>> file... no?
>
> I agree for wikipedia. But because nutch is content fetching tool it 
> would be useful to have some sort of tool to use that content for 
> creating ngram profiles. It seems natural. Maybe it would be possible 
> to create some sort of export in plain text of indexed content..

We use this content on a  regular basis.   Short story is grab their
MediaWiki dump and do a simple text extractor, which will probably meed 
to be
modified somewhat regularly.  We have a more complex structured text
extractor in Perl we use because we want more of the structure retained.

Some issues:

1)  It is some of the largest and most varied language collections about.
2)  It is not typical text, no conversational, less commercial, and lots of
     bizzare stuff that could confuse ngram analysis (e.g., character 
tables)
3)  The standard dump is MediaWiki  format which changes a bit almost 
every month.
      They do supply some tools for conversion, e.g., a PHP script for 
conversion to HTML,
       but at any one point in time these are usually broken on many pages.
4)   The alternatve is just to run the MediaWiki app. and crawl insitu.  
This takes almost
       a week on a single mediocre CPU box because MediaWiki converion 
is s-l-o-w.
      So recommend against this option.
5)   We maintain a Perl MediaWiki to poor-man's HTML converter that we 
use to retain
       broad HTML structure.  I expect to just extract text and retain 
proper
       sentence and word boundaries, your task will be easy.

Wray Buntine

Re: nutch and multilingualism

Posted by Ivan Sekulovic <se...@net.yu>.

Jérôme Charron wrote:

>>Would it be possible to generate ngram profiles for LanguageIdentifier
>>plugin from crawled content and not from file? What is my idea? The best
>>source for content in one language could be wikipedia.org.  We would
>>just crawl the wikipedia in desired language and then create ngram
>>profile from it. What are your thoughts about this idea?
>>    
>>
>
>I think it could be a good idea.
>Wikipedia could be a good source (not sure the best one).
>But instead of crawling wikipedia, it would probably be easier to download a
>wikipedia dump
>(http://download.wikimedia.org/)  and then extracts its textual content to a
>file... no?
>
>  
>
I agree for wikipedia. But because nutch is content fetching tool it 
would be useful to have some sort of tool to use that content for 
creating ngram profiles. It seems natural. Maybe it would be possible to 
create some sort of export in plain text of indexed content..

Sekula

Re: nutch and multilingualism

Posted by Jérôme Charron <je...@gmail.com>.

> Would it be possible to generate ngram profiles for LanguageIdentifier
> plugin from crawled content and not from file? What is my idea? The best
> source for content in one language could be wikipedia.org.  We would
> just crawl the wikipedia in desired language and then create ngram
> profile from it. What are your thoughts about this idea?

I think it could be a good idea.
Wikipedia could be a good source (not sure the best one).
But instead of crawling wikipedia, it would probably be easier to download a
wikipedia dump
(http://download.wikimedia.org/)  and then extracts its textual content to a
file... no?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: nutch and multilingualism

Posted by Ivan Sekulovic <se...@net.yu>.

Hi Jerome!

Would it be possible to generate ngram profiles for LanguageIdentifier 
plugin from crawled content and not from file? What is my idea? The best 
source for content in one language could be wikipedia.org.  We would 
just crawl the wikipedia in desired language and then create ngram 
profile from it. What are your thoughts about this idea?

Best Regards,
Ivan

Jérôme Charron wrote:

>>What is the good strategy to adopt for multilingualism sites ?
>>    
>>
>
>I want nutch to index a site in the different languages and
>  
>
>>then, the search only prints results that are in the user language.
>>    
>>
>
>Hi Laurent,
>
>What I can suggest is to :
>1. use the languageidentifier plugin while crawling in order to guess the
>language of the content
>2. automatically filters the results by adding the lang:<user_agent_lang>
>clause to the query (could be done in the jsp).
>
>Jérôme
>
>--
>http://motrech.free.fr/
>http://www.frutch.org/
>
>  
>
>------------------------------------------------------------------------
>
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.375 / Virus Database: 268.1.1/273 - Release Date: 2.3.2006
>  
>

Re: nutch and multilingualism

Posted by Jérôme Charron <je...@gmail.com>.

> What is the good strategy to adopt for multilingualism sites ?

I want nutch to index a site in the different languages and
> then, the search only prints results that are in the user language.

Hi Laurent,

What I can suggest is to :
1. use the languageidentifier plugin while crawling in order to guess the
language of the content
2. automatically filters the results by adding the lang:<user_agent_lang>
clause to the query (could be done in the jsp).

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/