You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ilia Sretenskii <sr...@multivi.ru> on 2014/09/05 16:06:04 UTC

How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

Re: How to implement multilingual word components fields schema?

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Ilia,

I see that Trey answered your question about how you might stack
language specific filters in one field.  If I remember correctly, his
approach assumes you have identified the language of the query.  That
is not the same as detecting the script of the query and is much
harder.

Trying to do language-specific processing on multiple languages,
especially a large number such as the 200 you mention or the 400 in
HathiTrust is a very difficult problem.  Detecting language (rather
than script) in short queries is an open problem in the research
literature.  As others have suggested, you might want to start with
something less ambitious that meets most of your business needs.

You also might want to consider whether the errors a stemmer might
make on some queries will be worth the increase in recall that you
will get on others. Concern about getting results that can confuse
users is the one of the  main reason we haven't seriously pursued
stemming in HathiTrust full-text search.

Regarding the papers listed in my previous e-mail, you can get the
first paper at the link I gave and the second paper (although on
re-reading it, I don't think it will be very useful) is available if
you go to the link for the code and follow the link on that page for
the paper.

I suspect you might want to think about the differences between
scripts and  languages.  Most of the Solr/Lucene stemmers either
assume you are only giving them the language they are designed for, or
work on the basis of script.  This works well when there is only one
language per script, but breaks if you have many languages using the
same script such as the Latin-1 languages.

(Because of an issue with the Solr-user spam filter and an issue with
my e-mail client all the URLs except the one below have
http[s] removed and/or spaces added.  See this gist for all the URLS
with context:  https://gist.github.com/anonymous/2e1233d80f37354001a3)

That PolyGlotStemming filter uses the ICUTokenizer's script
identification, but there are at least 12 different languages that use
the Arabic script (www omniglot com writing arabic)  and over 100 that
use Latin-1.  Please see the list of languages and scripts at
aspell. net/ man-html /Supported.   html#Supported. or www. omniglot
.com /writing/langalph .htm#latin

As a simple example, German and English both use the Latin-1 character
set.  Using an English stemmer for German or a German stemmer for
English is unlikely to work very well.  If you try to use stop words
for multiple languages you will run into difficulties where a stop
word in one language is a content word in another.  For example  if
you use German stop words such as "die", you will eliminate the
English content word "die".

Identifying languages in short texts such as queries is a hard
problem. About half the papers looking at query language
identification cheat, and look at things such as the language of the
pages that a user has clicked on.  If all you have to make a guess is
the text of the query, language identification is very difficult.  I
suspect that mixed script queries are even harder  (see www .transacl.
org/wp-content/uploads/2014/02/38.pdf).

 See the papers by Marco Lui and Tim Baldwin on Marco's web page:
ww2  .cs. mu. oz. au /~mlui/
In this paper they explain why short text language identification is a
hard problem "Language Identification: the Long and the Short of the
Matter"  www  .aclweb.  org/anthology/N10-1027

Other papers available on Marco's page describe the design and
implementation of langid.py which is a state-of-the-art language
identification program.

 I've tried several  language guessers  designed for short texts and
at least on queries from our query logs,  the results were unusable.
Both langid.py  with the defaults (noEnglish.langid.gz  pipe
delimited) and ldig with the most recent latin.model
(NonEnglish.ldig.gz tab delimited) did not work well for our queries.

However, both of these have parameters that can be tweaked and also
facilities for training if you have labeled data.

ldig is specifically designed to run on short text like queries or twitter.
It can be configured to spit out the scores for each language instead
of only the highest score (default).  Also we didn't try to limit the
list of languages it looks for, and that might give better results.

github .com  /shuyo/ldig
langdetect looks like its by the same programmer and is in Java, but I
haven't tried it:
code .google. com/p/language-detection/

langid is designed by linguistic experts, but may need to be trained
on short queries.
github .com/saffsd/langid.py

There is also Mike McCandless' port of the Google CLD

blog. mikemccandless .com/2013/08/a-new-version-of-compact-language  .html
code .google .com/p/chromium-compact-language-detector/source/browse/README
However here is this comment:
"Indeed I see the same results as you; I think CLD2 is just not
designed for short text."
and a similar comment was made in this talk:
videolectures .net/russir2012_grigoriev_language/

If you aren't worried about false drops and your documents are
relatively short and your use case favors recall over precision you
might want to look at McNamee and Mayfield's work on
language-independent stemming.   I don't know if their n-gram approach
would be feasible for your use case, but they also  got good results
on TREC/CLEF newswire article datasets with just truncating words.
We can't use their approach because we already have a high recall/low
precision situation and because our documents are several orders of
magnitude larger than the TREC/CLEF/FIRE newswire articles they tested
with.

Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing
morphological variation in alphabetic languages. In Proceedings of the
32nd international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '09). ACM, New York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://     doi.acm.
org/10.1145/1571941.1571957

Paul McNamee, Charles Nicholas, and James Mayfield. 2008. Don't have a
stemmer?: be un+concern+ed. In Proceedings of the 31st annual
international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '08). ACM, New York, NY, USA, 813-814.
DOI=10.1145/1390334.1390518 http://
doi. acm  .org/10.1145/1390334.1390518

I hope this helps.

Tom

On Mon, Sep 8, 2014 at 1:33 AM, Ilia Sretenskii <sr...@multivi.ru> wrote:
> Thank you for the replies, guys!
>
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
>
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
>
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
>
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
>

> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them?

Re: How to implement multilingual word components fields schema?

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Ilia,

one aspect you surely loose with a single field approach is the differentiation of semantic fields in different languages for words that sounds the same.
The words "sitting" and "directions" are easy example that have fully different semantics in French and English, at least.
"directions" would appear common with, say, teacher advice in English but not in French.

I disagree that the storage should be an issue in your case…. most solr installations do not suffer from that, as far as I can read the list. Generally, you do not need all these stemmed fields to be stored, they're just indexed and that is pretty tiny a storage.

Using separate fields also has advantages in terms of IDF, I think.

I do not understand the last question to Tom, he provides URLs to at least one of the papers.

Also, if you can put a hand on it, the book of Peters, Braschler, and Clough is probably relevant: http://link.springer.com/book/10.1007%2F978-3-642-23008-0 but, as the first article referenced by Tom says, the CLIR approach here relies on parallel corpora, e.g. created by automatic translations.

Paul

On 8 sept. 2014, at 07:33, Ilia Sretenskii <sr...@multivi.ru> wrote:

> Thank you for the replies, guys!
> 
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
> 
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
> 
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
> 
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> 
> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them?

Re: How to implement multilingual word components fields schema?

Posted by Trey Grainger <so...@gmail.com>.

Hi Ilia,

When writing *Solr in Action*, I implemented a feature which can do what
you're asking (allow multiple, dynamic analyzers to be used in a single
text field). This would allow you to use the same field and dynamically
change the analyzers (for example, you could do language-identification on
documents and only stem to the identified languages). It also support more
than one Analyzer per field (i.e. if you single documents or queries
containing multiple languages).

This seems to be a feature request which comes up regularly, so I just
submitted a new feature request on JIRA to add this feature to Solr and
track the progress:
https://issues.apache.org/jira/browse/SOLR-6492

I included a comment showing how to use the functionality currently
described in *Solr in Action*, but I plan to make it easier to use over the
next 2 months before calling it done. I'm going to be talking about
multilingual search in November at Lucene/Solr Revolution, so I'd ideally
like to finish before then so I can demonstrate it there.

Thanks,

-Trey Grainger
Director of Engineering, Search & Analytics @ CareerBuilder


On Mon, Sep 8, 2014 at 3:31 PM, Jorge Luis Betancourt Gonzalez <
jlbetancourt@uci.cu> wrote:

> In one of the talks by Trey Grainger (author of Solr in Action) it touches
> how on CareerBuilder are dealing with multilingual with payloads, its a
> little more of work but I think it would payoff.
>
> On Sep 8, 2014, at 7:58 AM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
> > You also need to take a stance as to whether you wish to auto-detect the
> language at query time vs. have a UI selection of language vs. attempt to
> perform the same query for each available language and then "determine"
> which has the best "relevancy". The latter two options are very sensitive
> to short queries. Keep in mind that auto-detection for indexing full
> documents is a different problem that auto-detection for very short queries.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Ilia Sretenskii
> > Sent: Sunday, September 7, 2014 10:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to implement multilingual word components fields schema?
> >
> > Thank you for the replies, guys!
> >
> > Using field-per-language approach for multilingual content is the last
> > thing I would try since my actual task is to implement a search
> > functionality which would implement relatively the same possibilities for
> > every known world language.
> > The closest references are those popular web search engines, they seem to
> > serve worldwide users with their different languages and even
> > cross-language queries as well.
> > Thus, a field-per-language approach would be a sure waste of storage
> > resources due to the high number of duplicates, since there are over 200
> > known languages.
> > I really would like to keep single field for cross-language searchable
> text
> > content, witout splitting it into specific language fields or specific
> > language cores.
> >
> > So my current choice will be to stay with just the ICUTokenizer and
> > ICUFoldingFilter as they are without any language specific
> > stemmers/lemmatizers yet at all.
> >
> > Probably I will put the most popular languages stop words filters and
> > stemmers into the same one searchable text field to give it a try and see
> > if it works correctly in a stack.
> > Does specific language related filters stacking work correctly in one
> field?
> >
> > Further development will most likely involve some advanced custom
> analyzers
> > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> > ScriptAttribute.
> > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> >
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> >
> > So I would like to know more about those "academic papers on this issue
> of
> > how best to deal with mixed language/mixed script queries and documents".
> > Tom, could you please share them?
>
> Concurso "Mi selfie por los 5". Detalles en
> http://justiciaparaloscinco.wordpress.com
>

Re: How to implement multilingual word components fields schema?

Posted by Jorge Luis Betancourt Gonzalez <jl...@uci.cu>.

In one of the talks by Trey Grainger (author of Solr in Action) it touches how on CareerBuilder are dealing with multilingual with payloads, its a little more of work but I think it would payoff. 

On Sep 8, 2014, at 7:58 AM, Jack Krupansky <ja...@basetechnology.com> wrote:

> You also need to take a stance as to whether you wish to auto-detect the language at query time vs. have a UI selection of language vs. attempt to perform the same query for each available language and then "determine" which has the best "relevancy". The latter two options are very sensitive to short queries. Keep in mind that auto-detection for indexing full documents is a different problem that auto-detection for very short queries.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Ilia Sretenskii
> Sent: Sunday, September 7, 2014 10:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to implement multilingual word components fields schema?
> 
> Thank you for the replies, guys!
> 
> Using field-per-language approach for multilingual content is the last
> thing I would try since my actual task is to implement a search
> functionality which would implement relatively the same possibilities for
> every known world language.
> The closest references are those popular web search engines, they seem to
> serve worldwide users with their different languages and even
> cross-language queries as well.
> Thus, a field-per-language approach would be a sure waste of storage
> resources due to the high number of duplicates, since there are over 200
> known languages.
> I really would like to keep single field for cross-language searchable text
> content, witout splitting it into specific language fields or specific
> language cores.
> 
> So my current choice will be to stay with just the ICUTokenizer and
> ICUFoldingFilter as they are without any language specific
> stemmers/lemmatizers yet at all.
> 
> Probably I will put the most popular languages stop words filters and
> stemmers into the same one searchable text field to give it a try and see
> if it works correctly in a stack.
> Does specific language related filters stacking work correctly in one field?
> 
> Further development will most likely involve some advanced custom analyzers
> like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> ScriptAttribute.
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> 
> So I would like to know more about those "academic papers on this issue of
> how best to deal with mixed language/mixed script queries and documents".
> Tom, could you please share them? 

Concurso "Mi selfie por los 5". Detalles en http://justiciaparaloscinco.wordpress.com

Re: How to implement multilingual word components fields schema?

Posted by Jack Krupansky <ja...@basetechnology.com>.

You also need to take a stance as to whether you wish to auto-detect the 
language at query time vs. have a UI selection of language vs. attempt to 
perform the same query for each available language and then "determine" 
which has the best "relevancy". The latter two options are very sensitive to 
short queries. Keep in mind that auto-detection for indexing full documents 
is a different problem that auto-detection for very short queries.

-- Jack Krupansky

-----Original Message----- 
From: Ilia Sretenskii
Sent: Sunday, September 7, 2014 10:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them?

Re: How to implement multilingual word components fields schema?

Posted by Ilia Sretenskii <sr...@multivi.ru>.

Thank you for the replies, guys!

Using field-per-language approach for multilingual content is the last
thing I would try since my actual task is to implement a search
functionality which would implement relatively the same possibilities for
every known world language.
The closest references are those popular web search engines, they seem to
serve worldwide users with their different languages and even
cross-language queries as well.
Thus, a field-per-language approach would be a sure waste of storage
resources due to the high number of duplicates, since there are over 200
known languages.
I really would like to keep single field for cross-language searchable text
content, witout splitting it into specific language fields or specific
language cores.

So my current choice will be to stay with just the ICUTokenizer and
ICUFoldingFilter as they are without any language specific
stemmers/lemmatizers yet at all.

Probably I will put the most popular languages stop words filters and
stemmers into the same one searchable text field to give it a try and see
if it works correctly in a stack.
Does specific language related filters stacking work correctly in one field?

Further development will most likely involve some advanced custom analyzers
like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
ScriptAttribute.
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java

So I would like to know more about those "academic papers on this issue of
how best to deal with mixed language/mixed script queries and documents".
Tom, could you please share them?

Re: How to implement multilingual word components fields schema?

Posted by Tom Burton-West <tb...@umich.edu>.

Hi Ilia,

I don't know if it would be helpful but below I've listed  some academic
papers on this issue of how best to deal with mixed language/mixed script
queries and documents.  They are probably taking a more complex approach
than you will want to use, but perhaps they will help to think about the
various ways of approaching the problem.

We haven't tackled this problem yet. We have over 200 languages.  Currently
we are using the ICUTokenizer and ICUFolding filter but don't do any
stemming due to a concern with overstemming (we have very high recall, so
don't want to hurt precision by stemming)  and the difficulty of correct
language identification of short queries.

If you have languages where there is only one language per script however,
you might be able to do much more.  I'm not sure if I'm remembering
correctly but I believe some of the stemmers such as the Greek stemmer will
pass through any strings that don't contain characters in the Greek script.
  So it might be possible to at least do stemming on some of your
languages/scripts.

 I'll be very interested to learn what approach you end up using.

Tom

------

Some papers:

Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and
weighting of multilingual and mixed documents. In *Proceedings of the South
African Institute of Computer Scientists and Information Technologists
Conference on Knowledge, Innovation and Leadership in a Diverse,
Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA,
161-170. DOI=10.1145/2072221.2072240
http://doi.acm.org/10.1145/2072221.2072240

That paper and some others are here:
http://www.husseinsspace.com/research/students/mohammedmustafaali.html

There is also some code from this article:

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo
Rosso. 2014. Query expansion for mixed-script information retrieval.
In *Proceedings
of the 37th international ACM SIGIR conference on Research & development in
information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622

Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburtonw@umich.edu
http://www.hathitrust.org/blogs/large-scale-search

On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii <sr...@multivi.ru>
wrote:

> Hello.
> We have documents with multilingual words which consist of different
> languages parts and seach queries of the same complexity, and it is a
> worldwide used online application, so users generate content in all the
> possible world languages.
>
> For example:
> 言語-aware
> Løgismose-alike
> ຄໍາຮ້ອງສະຫມັກ-dependent
>
> So I guess our schema requires a single field with universal analyzers.
>
> Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.
>
> But then it requires stemming and lemmatization.
>
> How to implement a schema with universal stemming/lemmatization which would
> probably utilize the ICU generated token script attribute?
>
> http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html
>
> By the way, I have already examined the Basistech schema of their
> commercial plugins and it defines tokenizer/filter language per field type,
> which is not a universal solution for such complex multilingual texts.
>
> Please advise how to address this task.
>
> Sincerely, Ilia Sretenskii.
>

RE: How to implement multilingual word components fields schema?

Posted by Susheel Kumar <su...@thedigitalgroup.net>.

Agree with the approach Jack suggested to use same source text in multiple fields for each language and then doing a dismax query.  Would love to hear if it works for you?

Thanks,
Susheel

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com]
Sent: Friday, September 05, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

It comes down to how you personally want to value compromises between conflicting requirements, such as relative weighting of false positives and false negatives. Provide a few use cases that illustrate the boundary cases that you care most about. For example field values that have snippets in one language embedded within larger values in a different language. And, whether your fields are always long or sometimes short - the former can work well for language detection, but not the latter, unless all fields of a given document are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-----Original Message-----
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.

Re: How to implement multilingual word components fields schema?

Posted by Jack Krupansky <ja...@basetechnology.com>.

It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well 
for language detection, but not the latter, unless all fields of a given 
document are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-----Original Message----- 
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.