You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kuntal Ganguly (JIRA)" <ji...@apache.org> on 2015/05/07 15:28:59 UTC

[jira] [Created] (SOLR-7509) Solr Multilingual Indexing with one field

Kuntal Ganguly created SOLR-7509:
------------------------------------

             Summary: Solr Multilingual Indexing with one field
                 Key: SOLR-7509
                 URL: https://issues.apache.org/jira/browse/SOLR-7509
             Project: Solr
          Issue Type: Wish
          Components: Schema and Analysis
    Affects Versions: 4.2.1
         Environment: Redhat Linux, 4 core, 12 GB
            Reporter: Kuntal Ganguly


Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
    </analyzer>
    </fieldType>

And the above field type is working well for the US and English language clients.

Now we have some new Chinese and Japanese client ,so after google
http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/

https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search

 for best approach for multilingual index,there seems to be pros/cons associated with every approach.

Then i tried RnD with a single field approach and here's my new field type:

<fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
	<filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
 	<filter class="solr.CJKBigramFilterFactory"/>
    </analyzer>
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
	<filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
	<filter class="solr.CJKBigramFilterFactory"/>
        <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
    </analyzer>
    </fieldType>

I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.

Now i have the following questions to the Solr experts/developer:

1) Is this a correct approach to do it? Or i'm missing something?

2) Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.

3) Also is there any problem in future with different clients coming up?

Please provide some guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org