You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephane Fellah <sf...@smartrealm.com> on 2011/03/11 04:29:30 UTC

Indexing of multilingual labels

I  am trying to index in Lucene a field that could have label of concepts in
different languages. Most of the approaches I have seen so far are:

   -

   Use a single index, where each document has a field per each language it
   uses, or
   -

   Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes to
term. Is anyone use this mechanism to store language (or other attributes
such as datatypes) information ? Does this approach if labels are the same
in different languages (does it break inverted index) ? How is performance
compared to the two other approaches ? Any pointer on source code showing
how it is done would help.

Thanks

-- 
Stephane Fellah, M.Sc, B.Sc
Principal Engineer/Product Manager
smartRealm LLC
201 Loudoun St. SW
Leesburg, VA 20175
Tel: 703 669 5514
Cell: 571 502 8478
Fax: 703 669 5515

Re: Indexing of multilingual labels

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Stephane,

I think that you have the freedom to put what you want in the stored value of a field.

The simplest would even be to make it that the fields that you want to use for display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be separate from the indexed fields (where you are only interested to the plain text). 
Payloads seem to be doing a similar job as a separate stored, non-indexed field.

The best approach I had thus far was to use a multiplexing analyzer (which is called for indexed fields only anyways) that recognizes the language by the suffix of the field name.

As to the difference between one index and several fields or one field in many indices, I think it is just a programming difference. The tf and idf are always done at the term level so they make no difference. 

I tend to prefer multiple fields because it's easier to expand a query for, say, Fourrier sent by a browser that says English but also accepts french and German into:
- a query for Fourrier in the whitespace-tokenized track (always prefer that one)
- a query for fouri in the French field
- a query for fourier in the English and German fields
My current experience is that many users appear or claim to speak many languages (they do, a little bit).

hope it helps.

paul

PS: not that my code is ideal but here are the ones I have:
 - i2geo, based on an ontology of concepts in OWL, 
	http://i2geo.net/xwiki/bin/view/About/GeoSkills
   and http://svn.activemath.org/intergeo/Platform/SearchI2G/
 - ActiveMath, fed by XML, http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-summary.html and 


Le 11 mars 2011 à 16:35, Stephane Fellah a écrit :

> Erick,
> 
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred and
> alternative labels in different languages. Some labels are the same lexical
> form in different languages. I want to be able to index these concepts in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition with
> all the alternative labels in different languages. My question is: could we
> use the payload mechanism to store the language assigned to the word (i read
> somewhere Google was using payload to store information such as font for
> example, so why not language) ? Wouldn't be a better approach then using one
> field per language or one index per language ?
> 
> REgards
> Stephane
> 
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson <er...@gmail.com>wrote:
> 
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>> 
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>> 
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions....
>> 
>> Best
>> Erick
>> 
>> 
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>> <sf...@smartrealm.com> wrote:
>>> I  am trying to index in Lucene a field that could have label of concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>> 
>>> -
>>> 
>>> Use a single index, where each document has a field per each language
>> it
>>> uses, or
>>> -
>>> 
>>> Use M indexes, M being the number of languages in the corpus.
>>> 
>>> Lucene 2.9+ has a feature called Payload that allows to attach attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other attributes
>>> such as datatypes) information ? Does this approach if labels are the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code showing
>>> how it is done would help.
>>> 
>>> Thanks
>>> 
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> Stephane Fellah, M.Sc, B.Sc
> Principal Engineer/Product Manager
> smartRealm LLC
> 201 Loudoun St. SW
> Leesburg, VA 20175
> Tel: 703 669 5514
> Cell: 571 502 8478
> Fax: 703 669 5515


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing of multilingual labels

Posted by Vinaya Kumar Thimmappa <vt...@ariba.com>.
Hello Stephane,

I think a better way is to have resource file with different language 
and store pointer in the index to get to correct resource file ( 
Something like  I18N and L10N approach). Store the internationalised 
string in index  and all related localised string in resource file .

This way index size will be reduced (adding to payload would have impact 
on performance)
and help performance too.

Now your Total Search Time would be (searchtime+time to retrieve the 
language based data)

Hope this helps.
-Vinaya

On Friday 11 March 2011 09:05 PM, Stephane Fellah wrote:
> Erick,
>
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred and
> alternative labels in different languages. Some labels are the same lexical
> form in different languages. I want to be able to index these concepts in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition with
> all the alternative labels in different languages. My question is: could we
> use the payload mechanism to store the language assigned to the word (i read
> somewhere Google was using payload to store information such as font for
> example, so why not language) ? Wouldn't be a better approach then using one
> field per language or one index per language ?
>
> REgards
> Stephane
>
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson<er...@gmail.com>wrote:
>
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>>
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>>
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions....
>>
>> Best
>> Erick
>>
>>
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>> <sf...@smartrealm.com>  wrote:
>>> I  am trying to index in Lucene a field that could have label of concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>>
>>>    -
>>>
>>>    Use a single index, where each document has a field per each language
>> it
>>>    uses, or
>>>    -
>>>
>>>    Use M indexes, M being the number of languages in the corpus.
>>>
>>> Lucene 2.9+ has a feature called Payload that allows to attach attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other attributes
>>> such as datatypes) information ? Does this approach if labels are the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code showing
>>> how it is done would help.
>>>
>>> Thanks
>>>
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing of multilingual labels

Posted by Stephane Fellah <sf...@smartrealm.com>.
Erick,

I am trying to index multilingual taxonomies such as SKOS, Wordnet,
Eurowordnet. Taxonomies are composed of concepts which have preferred and
alternative labels in different languages. Some labels are the same lexical
form in different languages. I want to be able to index these concepts in
Lucene in order to be able to search concepts by their label in one or
several languages. I want also be able to display concept definition with
all the alternative labels in different languages. My question is: could we
use the payload mechanism to store the language assigned to the word (i read
somewhere Google was using payload to store information such as font for
example, so why not language) ? Wouldn't be a better approach then using one
field per language or one index per language ?

REgards
Stephane

On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson <er...@gmail.com>wrote:

> It's not so much a matter of problems with indexing/searching
> as it is with search behavior. The reason these strategies
> are implemented is that using English stemming, say, on
> other languages will produce "interesting" results.
>
> There's no a-priori reason you can't index multiple languages
> in the same field.
>
> So I don't see what you would accomplish by using payloads
> to indicate which language the term is in. Could you expand
> a bit on what you're trying to accomplish here? Maybe there
> are better solutions....
>
> Best
> Erick
>
>
> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
> <sf...@smartrealm.com> wrote:
> > I  am trying to index in Lucene a field that could have label of concepts
> in
> > different languages. Most of the approaches I have seen so far are:
> >
> >   -
> >
> >   Use a single index, where each document has a field per each language
> it
> >   uses, or
> >   -
> >
> >   Use M indexes, M being the number of languages in the corpus.
> >
> > Lucene 2.9+ has a feature called Payload that allows to attach attributes
> to
> > term. Is anyone use this mechanism to store language (or other attributes
> > such as datatypes) information ? Does this approach if labels are the
> same
> > in different languages (does it break inverted index) ? How is
> performance
> > compared to the two other approaches ? Any pointer on source code showing
> > how it is done would help.
> >
> > Thanks
> >
> > --
> > Stephane Fellah, M.Sc, B.Sc
> > Principal Engineer/Product Manager
> > smartRealm LLC
> > 201 Loudoun St. SW
> > Leesburg, VA 20175
> > Tel: 703 669 5514
> > Cell: 571 502 8478
> > Fax: 703 669 5515
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Stephane Fellah, M.Sc, B.Sc
Principal Engineer/Product Manager
smartRealm LLC
201 Loudoun St. SW
Leesburg, VA 20175
Tel: 703 669 5514
Cell: 571 502 8478
Fax: 703 669 5515

Re: Indexing of multilingual labels

Posted by Erick Erickson <er...@gmail.com>.
It's not so much a matter of problems with indexing/searching
as it is with search behavior. The reason these strategies
are implemented is that using English stemming, say, on
other languages will produce "interesting" results.

There's no a-priori reason you can't index multiple languages
in the same field.

So I don't see what you would accomplish by using payloads
to indicate which language the term is in. Could you expand
a bit on what you're trying to accomplish here? Maybe there
are better solutions....

Best
Erick


On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
<sf...@smartrealm.com> wrote:
> I  am trying to index in Lucene a field that could have label of concepts in
> different languages. Most of the approaches I have seen so far are:
>
>   -
>
>   Use a single index, where each document has a field per each language it
>   uses, or
>   -
>
>   Use M indexes, M being the number of languages in the corpus.
>
> Lucene 2.9+ has a feature called Payload that allows to attach attributes to
> term. Is anyone use this mechanism to store language (or other attributes
> such as datatypes) information ? Does this approach if labels are the same
> in different languages (does it break inverted index) ? How is performance
> compared to the two other approaches ? Any pointer on source code showing
> how it is done would help.
>
> Thanks
>
> --
> Stephane Fellah, M.Sc, B.Sc
> Principal Engineer/Product Manager
> smartRealm LLC
> 201 Loudoun St. SW
> Leesburg, VA 20175
> Tel: 703 669 5514
> Cell: 571 502 8478
> Fax: 703 669 5515
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org