You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by aurora <au...@gmail.com> on 2005/01/20 21:08:09 UTC

Lucene and multiple languages

I'm trying to build some web search tool that could work for multiple  
languages. I understand that Lucene is shipped with StandardAnalyzer plus  
a German and Russian analyzers and some more in the sandbox. And that  
indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages and  
analyzed by an assortment of analyzers. When user enter a query, what  
analyzer should be used? Should the user be asked for the language  
upfront? What to expect when the analyzer and the document doesn't match?  
Let's said the query is parsed using StandardAnalyzer. Would it match any  
documents done in German analyzer at all. Or would it end up in poor  
result?

Also is there a good way to find out the languages used in a web page?  
There is a 'content-langage' header in http and a 'lang' attribute in  
HTML. Looks like people don't really use them. How can we recognize the  
language?

Even more interesting is multiple languages used in one document, let's  
say half English and half French. Is there a good way to deal with those  
cases?

Thanks for any guidance.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene and multiple languages

Posted by Daniel Naber <da...@t-online.de>.

On Thursday 20 January 2005 21:08, aurora wrote:

> Now let's said I have an index with documents in multiple languages and
>   analyzed by an assortment of analyzers. When user enter a query, what
> analyzer should be used?

Use q1 OR q2, where q1 is the query parsed with the analyzer for language 
1, q2 is the query parsed with the analyzer for language 2 (and so on). If 
there are conflicts you could also add a required term query to each 
subquery, like "language:en^0" so that, for example, the English analyzer 
query only searches on documents that have been identified as English.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene and multiple languages

Posted by Ernesto De Santis <er...@colaborativa.net>.

I send you the source code in a private mail.

Ernesto.

aurora escribió:

> Thanks. I would like to give it a try. Is the source code available? 
> I'm  using a Python version of Lucene so it would need to be wrapped 
> or ported  :)
>
>> Hi Aurora
>>
>> I develop a tool with this multiple languages issue. I found very useful
>> an nuke library "language-identifier". This jar have nuke dependencies,
>> but I delete all unnecessary code (for me obvious).
>>
>> This language-identifier that I use work fine and is very simple:
>> For example:
>>
>> LanguageIdentifier languageIdentifier = 
>> LanguageIdentifier.getInstance();
>> String userInputText = "free text";
>> String language = languageIdentifier.identify(text);
>>
>> This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
>> French, Italian, and Others.
>>
>> I can send you this touched jar, but remember that this jar is from
>> Nuke, for copyright (or left :).
>> http://www.nutch.org/LICENSE.txt
>>
>> More comments above...
>>
>> aurora escribió:
>>
>>> I'm trying to build some web search tool that could work for 
>>> multiple   languages. I understand that Lucene is shipped with 
>>> StandardAnalyzer  plus  a German and Russian analyzers and some more 
>>> in the sandbox. And  that  indexing and searching should use the 
>>> same analyzer.
>>>
>>> Now let's said I have an index with documents in multiple languages  
>>> and  analyzed by an assortment of analyzers. When user enter a 
>>> query,  what  analyzer should be used? Should the user be asked for 
>>> the  language  upfront? What to expect when the analyzer and the 
>>> document  doesn't match?  Let's said the query is parsed using 
>>> StandardAnalyzer.  Would it match any  documents done in German 
>>> analyzer at all. Or would  it end up in poor  result?
>>>
>> When this happen, in the major cases you do not obtain matchs.
>>
>>> Also is there a good way to find out the languages used in a web 
>>> page?   There is a 'content-langage' header in http and a 'lang' 
>>> attribute in   HTML. Looks like people don't really use them. How 
>>> can we recognize  the  language?
>>>
>> With language identifier. :)
>>
>>> Even more interesting is multiple languages used in one document,  
>>> let's  say half English and half French. Is there a good way to 
>>> deal  with those  cases?
>>>
>> Language identifier only return one language. I look into
>> language-identifier and work with a score for each language, and return
>> the language with greater value.
>> Maybe you can modify the language-identifier for take the most greater
>> values.
>>
>> Bye
>> Ernesto.
>>
>>> Thanks for any guidance.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene and multiple languages

Posted by aurora <au...@gmail.com>.

Thanks. I would like to give it a try. Is the source code available? I'm  
using a Python version of Lucene so it would need to be wrapped or ported  
:)

> Hi Aurora
>
> I develop a tool with this multiple languages issue. I found very useful
> an nuke library "language-identifier". This jar have nuke dependencies,
> but I delete all unnecessary code (for me obvious).
>
> This language-identifier that I use work fine and is very simple:
> For example:
>
> LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
> String userInputText = "free text";
> String language = languageIdentifier.identify(text);
>
> This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
> French, Italian, and Others.
>
> I can send you this touched jar, but remember that this jar is from
> Nuke, for copyright (or left :).
> http://www.nutch.org/LICENSE.txt
>
> More comments above...
>
> aurora escribió:
>
>> I'm trying to build some web search tool that could work for multiple   
>> languages. I understand that Lucene is shipped with StandardAnalyzer  
>> plus  a German and Russian analyzers and some more in the sandbox. And  
>> that  indexing and searching should use the same analyzer.
>>
>> Now let's said I have an index with documents in multiple languages  
>> and  analyzed by an assortment of analyzers. When user enter a query,  
>> what  analyzer should be used? Should the user be asked for the  
>> language  upfront? What to expect when the analyzer and the document  
>> doesn't match?  Let's said the query is parsed using StandardAnalyzer.  
>> Would it match any  documents done in German analyzer at all. Or would  
>> it end up in poor  result?
>>
> When this happen, in the major cases you do not obtain matchs.
>
>> Also is there a good way to find out the languages used in a web page?   
>> There is a 'content-langage' header in http and a 'lang' attribute in   
>> HTML. Looks like people don't really use them. How can we recognize  
>> the  language?
>>
> With language identifier. :)
>
>> Even more interesting is multiple languages used in one document,  
>> let's  say half English and half French. Is there a good way to deal  
>> with those  cases?
>>
> Language identifier only return one language. I look into
> language-identifier and work with a score for each language, and return
> the language with greater value.
> Maybe you can modify the language-identifier for take the most greater
> values.
>
> Bye
> Ernesto.
>
>> Thanks for any guidance.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>



-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene and multiple languages

Posted by Ernesto De Santis <er...@colaborativa.net>.

Hi Aurora

I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).

This language-identifier that I use work fine and is very simple:
For example:

LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);

This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.

I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt

More comments above...

aurora escribió:

> I'm trying to build some web search tool that could work for multiple  
> languages. I understand that Lucene is shipped with StandardAnalyzer 
> plus  a German and Russian analyzers and some more in the sandbox. And 
> that  indexing and searching should use the same analyzer.
>
> Now let's said I have an index with documents in multiple languages 
> and  analyzed by an assortment of analyzers. When user enter a query, 
> what  analyzer should be used? Should the user be asked for the 
> language  upfront? What to expect when the analyzer and the document 
> doesn't match?  Let's said the query is parsed using StandardAnalyzer. 
> Would it match any  documents done in German analyzer at all. Or would 
> it end up in poor  result?
>
When this happen, in the major cases you do not obtain matchs.

> Also is there a good way to find out the languages used in a web 
> page?  There is a 'content-langage' header in http and a 'lang' 
> attribute in  HTML. Looks like people don't really use them. How can 
> we recognize the  language?
>
With language identifier. :)

> Even more interesting is multiple languages used in one document, 
> let's  say half English and half French. Is there a good way to deal 
> with those  cases?
>
Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.

Bye
Ernesto.

> Thanks for any guidance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org