You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Aleksandar Kanchev <am...@gmail.com> on 2021/10/08 13:17:18 UTC

Questions regarding - index and query content on different languages in one Solr field

Hi,

My name is Aleksandar Kanchev.

I am a web developer and have been using Apache Solr for ten years now.

We have been having issues handling multi-lingual content for quite some
time now and we are sure it's not an Apache Solr deficiency but a knowledge
deficiency on our side.


*What we are trying to do*
Index and query content on different languages (English, Persian, Chinese,
Japanese, etc.) in one Solr field.

*The problem *

Different language groups need different tokenizers and filters. We can't
apply multiple tokenizers to the same Solr field type.

*How we tried to solve the problem*

   - Create a separate Solr field with a specific field type (different
   tokenizer and filters configuration) for each particular language group.
   Example:
   <field name="content_cjk" type="text_cjk" multiValued="false"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   <field name="content_pe" type="text_pe" multiValued="false"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   ...


   - Then create a copy field.
   Example:
   <field name="content" type="text_general" multiValued="true"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   <copyField source="content_cjk" dest="content"/>
   <copyField source="content_pe" dest="content"/>
   ...


*The issue*
Copy fields are copied before the analysis, so we lose the tokenization and
filtering from the language-specific fields.

*Questions*

   - Can we use copy fields and preserve the generated tokens from the
   source fields?
   - Is there a better way to apply multiple language-specific tokenizers
   and filters on a single Solr field?


Thanks in advance!


Best,

Alex

Re: Questions regarding - index and query content on different languages in one Solr field

Posted by Nicolas Franck <Ni...@UGent.be>.
You're hitting a problem that we've all tried to solve: multilingual indexes.

You're forgetting one important point: if the user enters a query, then
you have to choose which language field to search against, either
by language detection (out of band, by some other software?),
or by select-box in your gui. You want to prevent that solr
performs an incorrect field analysis..

Therefore copying that language specific field to another field after analysis (if that was possible)
does not solve the problem, because the analysis at query time needs
to match the analysis at index item.

I would focus on language detection of your input query,
and then search on that language specific field.

On 8 Oct 2021, at 15:17, Aleksandar Kanchev <am...@gmail.com>> wrote:

Hi,

My name is Aleksandar Kanchev.

I am a web developer and have been using Apache Solr for ten years now.

We have been having issues handling multi-lingual content for quite some
time now and we are sure it's not an Apache Solr deficiency but a knowledge
deficiency on our side.


*What we are trying to do*
Index and query content on different languages (English, Persian, Chinese,
Japanese, etc.) in one Solr field.

*The problem *

Different language groups need different tokenizers and filters. We can't
apply multiple tokenizers to the same Solr field type.

*How we tried to solve the problem*

  - Create a separate Solr field with a specific field type (different
  tokenizer and filters configuration) for each particular language group.
  Example:
  <field name="content_cjk" type="text_cjk" multiValued="false"
  indexed="true" stored="false" useDocValuesAsStored="false"/>
  <field name="content_pe" type="text_pe" multiValued="false"
  indexed="true" stored="false" useDocValuesAsStored="false"/>
  ...


  - Then create a copy field.
  Example:
  <field name="content" type="text_general" multiValued="true"
  indexed="true" stored="false" useDocValuesAsStored="false"/>
  <copyField source="content_cjk" dest="content"/>
  <copyField source="content_pe" dest="content"/>
  ...


*The issue*
Copy fields are copied before the analysis, so we lose the tokenization and
filtering from the language-specific fields.

*Questions*

  - Can we use copy fields and preserve the generated tokens from the
  source fields?
  - Is there a better way to apply multiple language-specific tokenizers
  and filters on a single Solr field?


Thanks in advance!


Best,

Alex