You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Vlad (Jira)" <ji...@apache.org> on 2020/09/04 21:25:00 UTC

[jira] [Updated] (SOLR-14832) Inversion Eglish and numbers characters in Arabic documents

     [ https://issues.apache.org/jira/browse/SOLR-14832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vlad updated SOLR-14832:
------------------------
    Description: 
Hi Support,

 

please help to resolve an issue. I upload/index several documents in English and in Arabic languages to SOLR, in addition I use handler for Arabic language:

  <fieldType name="text" class="solr.TextField" positionIncrementGap="50">

   <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                         <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

 

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                          <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

 

      </analyzer>

 

There are two environments:
 # Local machine:

                - SOLR version: 4,2

                - Windows version: 10

 
 # DEV env:

                - SOLR version 4.1 as part of the cloudera suit

                - Linux core version: 3.10.0-862

 

Issue appears when uploading documents:
 # Local machine:

                - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")

                - Doc in Arabic with some English words - ok (for example, "[www.apache.org|http://www.apache.org/]")

 
 # DEV env:

                - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")

                - Doc in Arabic with some English - English text is inverted (for example, "gro.echapa.www"), what makes search by key words impossible.

 

Please advise whether this fixable and how?

 

Thank you in advance!

  was:
Hi Support,

 

please help to resolve an issue. I upload/index several documents in English and in Arabic languages to SOLR, in addition I use handler for Arabic language:

  <fieldType name="text" class="solr.TextField" positionIncrementGap="50">

   <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                         <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

 

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

                          <filter class="solr.ArabicNormalizationFilterFactory"/>

        <filter class="solr.ArabicStemFilterFactory"/>

        <filter class="solr.LowerCaseFilterFactory"/>

 

      </analyzer>

 

There are two environments:
 # Local machine:

                - SOLR version: 4,2

                - Windows version: 10

 
 # DEV env:

                - SOLR version: 

                - Cloudera suit

                - Linux core version: 3.10.0-862

 

Issue appears when uploading documents:
 # Local machine:

                - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")

                - Doc in Arabic with some English words - ok (for example, "[www.apache.org|http://www.apache.org/]")

 
 # DEV env:

                - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")

                - Doc in Arabic with some English - English text is inverted (for example, "gro.echapa.www"), what makes search by key words impossible.

 

Please advise whether this fixable and how?


> Inversion Eglish and numbers characters in Arabic documents
> -----------------------------------------------------------
>
>                 Key: SOLR-14832
>                 URL: https://issues.apache.org/jira/browse/SOLR-14832
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 4.1
>            Reporter: Vlad
>            Priority: Major
>
> Hi Support,
>  
> please help to resolve an issue. I upload/index several documents in English and in Arabic languages to SOLR, in addition I use handler for Arabic language:
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
>    <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
>                          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                          <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>  
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                           <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>  
>       </analyzer>
>  
> There are two environments:
>  # Local machine:
>                 - SOLR version: 4,2
>                 - Windows version: 10
>  
>  # DEV env:
>                 - SOLR version 4.1 as part of the cloudera suit
>                 - Linux core version: 3.10.0-862
>  
> Issue appears when uploading documents:
>  # Local machine:
>                 - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")
>                 - Doc in Arabic with some English words - ok (for example, "[www.apache.org|http://www.apache.org/]")
>  
>  # DEV env:
>                 - Doc in English with English words only - ok (for example, "[www.apache.org|http://www.apache.org/]")
>                 - Doc in Arabic with some English - English text is inverted (for example, "gro.echapa.www"), what makes search by key words impossible.
>  
> Please advise whether this fixable and how?
>  
> Thank you in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org