You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by aurora <au...@gmail.com> on 2005/01/20 21:08:09 UTC
Lucene and multiple languages
I'm trying to build some web search tool that could work for multiple
languages. I understand that Lucene is shipped with StandardAnalyzer plus
a German and Russian analyzers and some more in the sandbox. And that
indexing and searching should use the same analyzer.
Now let's said I have an index with documents in multiple languages and
analyzed by an assortment of analyzers. When user enter a query, what
analyzer should be used? Should the user be asked for the language
upfront? What to expect when the analyzer and the document doesn't match?
Let's said the query is parsed using StandardAnalyzer. Would it match any
documents done in German analyzer at all. Or would it end up in poor
result?
Also is there a good way to find out the languages used in a web page?
There is a 'content-langage' header in http and a 'lang' attribute in
HTML. Looks like people don't really use them. How can we recognize the
language?
Even more interesting is multiple languages used in one document, let's
say half English and half French. Is there a good way to deal with those
cases?
Thanks for any guidance.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene and multiple languages
Posted by Daniel Naber <da...@t-online.de>.
On Thursday 20 January 2005 21:08, aurora wrote:
> Now let's said I have an index with documents in multiple languages and
> analyzed by an assortment of analyzers. When user enter a query, what
> analyzer should be used?
Use q1 OR q2, where q1 is the query parsed with the analyzer for language
1, q2 is the query parsed with the analyzer for language 2 (and so on). If
there are conflicts you could also add a required term query to each
subquery, like "language:en^0" so that, for example, the English analyzer
query only searches on documents that have been identified as English.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene and multiple languages
Posted by Ernesto De Santis <er...@colaborativa.net>.
I send you the source code in a private mail.
Ernesto.
aurora escribió:
> Thanks. I would like to give it a try. Is the source code available?
> I'm using a Python version of Lucene so it would need to be wrapped
> or ported :)
>
>> Hi Aurora
>>
>> I develop a tool with this multiple languages issue. I found very useful
>> an nuke library "language-identifier". This jar have nuke dependencies,
>> but I delete all unnecessary code (for me obvious).
>>
>> This language-identifier that I use work fine and is very simple:
>> For example:
>>
>> LanguageIdentifier languageIdentifier =
>> LanguageIdentifier.getInstance();
>> String userInputText = "free text";
>> String language = languageIdentifier.identify(text);
>>
>> This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
>> French, Italian, and Others.
>>
>> I can send you this touched jar, but remember that this jar is from
>> Nuke, for copyright (or left :).
>> http://www.nutch.org/LICENSE.txt
>>
>> More comments above...
>>
>> aurora escribió:
>>
>>> I'm trying to build some web search tool that could work for
>>> multiple languages. I understand that Lucene is shipped with
>>> StandardAnalyzer plus a German and Russian analyzers and some more
>>> in the sandbox. And that indexing and searching should use the
>>> same analyzer.
>>>
>>> Now let's said I have an index with documents in multiple languages
>>> and analyzed by an assortment of analyzers. When user enter a
>>> query, what analyzer should be used? Should the user be asked for
>>> the language upfront? What to expect when the analyzer and the
>>> document doesn't match? Let's said the query is parsed using
>>> StandardAnalyzer. Would it match any documents done in German
>>> analyzer at all. Or would it end up in poor result?
>>>
>> When this happen, in the major cases you do not obtain matchs.
>>
>>> Also is there a good way to find out the languages used in a web
>>> page? There is a 'content-langage' header in http and a 'lang'
>>> attribute in HTML. Looks like people don't really use them. How
>>> can we recognize the language?
>>>
>> With language identifier. :)
>>
>>> Even more interesting is multiple languages used in one document,
>>> let's say half English and half French. Is there a good way to
>>> deal with those cases?
>>>
>> Language identifier only return one language. I look into
>> language-identifier and work with a score for each language, and return
>> the language with greater value.
>> Maybe you can modify the language-identifier for take the most greater
>> values.
>>
>> Bye
>> Ernesto.
>>
>>> Thanks for any guidance.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene and multiple languages
Posted by aurora <au...@gmail.com>.
Thanks. I would like to give it a try. Is the source code available? I'm
using a Python version of Lucene so it would need to be wrapped or ported
:)
> Hi Aurora
>
> I develop a tool with this multiple languages issue. I found very useful
> an nuke library "language-identifier". This jar have nuke dependencies,
> but I delete all unnecessary code (for me obvious).
>
> This language-identifier that I use work fine and is very simple:
> For example:
>
> LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
> String userInputText = "free text";
> String language = languageIdentifier.identify(text);
>
> This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
> French, Italian, and Others.
>
> I can send you this touched jar, but remember that this jar is from
> Nuke, for copyright (or left :).
> http://www.nutch.org/LICENSE.txt
>
> More comments above...
>
> aurora escribió:
>
>> I'm trying to build some web search tool that could work for multiple
>> languages. I understand that Lucene is shipped with StandardAnalyzer
>> plus a German and Russian analyzers and some more in the sandbox. And
>> that indexing and searching should use the same analyzer.
>>
>> Now let's said I have an index with documents in multiple languages
>> and analyzed by an assortment of analyzers. When user enter a query,
>> what analyzer should be used? Should the user be asked for the
>> language upfront? What to expect when the analyzer and the document
>> doesn't match? Let's said the query is parsed using StandardAnalyzer.
>> Would it match any documents done in German analyzer at all. Or would
>> it end up in poor result?
>>
> When this happen, in the major cases you do not obtain matchs.
>
>> Also is there a good way to find out the languages used in a web page?
>> There is a 'content-langage' header in http and a 'lang' attribute in
>> HTML. Looks like people don't really use them. How can we recognize
>> the language?
>>
> With language identifier. :)
>
>> Even more interesting is multiple languages used in one document,
>> let's say half English and half French. Is there a good way to deal
>> with those cases?
>>
> Language identifier only return one language. I look into
> language-identifier and work with a score for each language, and return
> the language with greater value.
> Maybe you can modify the language-identifier for take the most greater
> values.
>
> Bye
> Ernesto.
>
>> Thanks for any guidance.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene and multiple languages
Posted by Ernesto De Santis <er...@colaborativa.net>.
Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);
This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
> I'm trying to build some web search tool that could work for multiple
> languages. I understand that Lucene is shipped with StandardAnalyzer
> plus a German and Russian analyzers and some more in the sandbox. And
> that indexing and searching should use the same analyzer.
>
> Now let's said I have an index with documents in multiple languages
> and analyzed by an assortment of analyzers. When user enter a query,
> what analyzer should be used? Should the user be asked for the
> language upfront? What to expect when the analyzer and the document
> doesn't match? Let's said the query is parsed using StandardAnalyzer.
> Would it match any documents done in German analyzer at all. Or would
> it end up in poor result?
>
When this happen, in the major cases you do not obtain matchs.
> Also is there a good way to find out the languages used in a web
> page? There is a 'content-langage' header in http and a 'lang'
> attribute in HTML. Looks like people don't really use them. How can
> we recognize the language?
>
With language identifier. :)
> Even more interesting is multiple languages used in one document,
> let's say half English and half French. Is there a good way to deal
> with those cases?
>
Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
> Thanks for any guidance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org