You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by ad...@ukr.net on 2020/09/07 09:43:29 UTC

Inverse English an digits in Arabic Text

Hi,

Could please help to resolve an issue. I upload/index several documents in English and in Arabic languages to SOLR, in addition I use handler for Arabic language:
  <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
   <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                         <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                          <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

There are two environments:
Local machine:                 - SOLR version: 4,2
                - Windows version: 10

DEV env:                 - SOLR version 4.1 as part of the cloudera suit
                - Linux core version: 3.10.0-862

Issue appears when uploading documents:
Local machine:                 - Doc in English with English words only - ok (for example, "www.apache.org")
                - Doc in Arabic with some English words - ok (for example, "www.apache.org")

DEV env:                 - Doc in English with English words only - ok (for example, "www.apache.org")
                - Doc in Arabic with some English - English text is inverted (for example, "gro.echapa.www"), what makes search by key words impossible.

Please advise whether this fixable and how?

Thank you in advance!

Fw: Inverse English an digits in Arabic Text

Posted by ad...@ukr.net.

Hi,

Could please help to resolve an issue. I upload/index several documents in English and in Arabic languages to SOLR, in addition I use handler for Arabic language:
  <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
   <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                         <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                          <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

There are two environments:
Local machine:                 - SOLR version: 4,2
                - Windows version: 10

DEV env:                 - SOLR version 4.1 as part of the cloudera suit
                - Linux core version: 3.10.0-862

Issue appears when uploading documents:
Local machine:                 - Doc in English with English words only - ok (for example, "www.apache.org")
                - Doc in Arabic with some English words - ok (for example, "www.apache.org")

DEV env:                 - Doc in English with English words only - ok (for example, "www.apache.org")
                - Doc in Arabic with some English - English text is inverted (for example, "gro.echapa.www"), what makes search by key words impossible.

Please advise whether this fixable and how?

Thank you in advance!

Re: Inverse English an digits in Arabic Text

Posted by Erick Erickson <er...@gmail.com>.

A quick test would be to send some simple queries by curl
rather than the browser, that’ll avoid any rendering issues.

Second, take a look at the admin UI>>pick_a_collection_from_the_dropdown>>analysis 
page and look at the terms in the field in question. Do they look “ok”?l You’re
looking at what’s actually indexed at that point. The Terms Component let’s
you look at the indexed terms more powerfully too:
https://lucene.apache.org/solr/guide/7_3/the-terms-component.html

Finally, since it’s OK in one environment and not in another, it’s very likely not
an issue with Solr itself, but something different about the environments, especially
the indexing process. I doubt it’s a difference between Solr 4.1 and 4.2…

Best,
Erick

> On Sep 7, 2020, at 7:10 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:
> 
>> Doc in Arabic with some English - English text is inverted (for example,
> "gro.echapa.www"), what makes search by key words impossible.
> 
> What very specifically do you mean by that. How do you see the inversion?
> 
> If that's within some sort of web ui, then you are probably seeing the HTML
> bidi (bidirectional LTR/RTL) presentation issues.
> 
> And if you are seeing in in Cloudera UI, then the question may be for their
> forum.
> 
> One way to test is to have English text in brackets "(www.apache.org)"
> within Arabic flow. If you see again your issue but the brackets get weird
> "((gro.....", this is most likely a bidi presentation issue with algorithm
> or HTML attribute set to RTL.
> 
> Could be something else though, but that would be a start point.
> 
> Regards,
>    Alex
> 
> 
> On Mon., Sep. 7, 2020, 5:54 a.m. , <ad...@ukr.net> wrote:
> 
>> Hi,
>> 
>> Could please help to resolve an issue. I upload/index several documents in
>> English and in Arabic languages to SOLR, in addition I use handler for
>> Arabic language:
>>  <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
>>   <analyzer type="index">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>>                         <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                         <filter
>> class="solr.ArabicNormalizationFilterFactory"/>
>>        <filter class="solr.ArabicStemFilterFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>                         <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>                          <filter
>> class="solr.ArabicNormalizationFilterFactory"/>
>>        <filter class="solr.ArabicStemFilterFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>      </analyzer>
>> 
>> There are two environments:
>> Local machine:                 - SOLR version: 4,2
>>                - Windows version: 10
>> 
>> DEV env:                 - SOLR version 4.1 as part of the cloudera suit
>>                - Linux core version: 3.10.0-862
>> 
>> Issue appears when uploading documents:
>> Local machine:                 - Doc in English with English words only -
>> ok (for example, "www.apache.org")
>>                - Doc in Arabic with some English words - ok (for example,
>> "www.apache.org")
>> 
>> DEV env:                 - Doc in English with English words only - ok
>> (for example, "www.apache.org")
>>                - Doc in Arabic with some English - English text is
>> inverted (for example, "gro.echapa.www"), what makes search by key words
>> impossible.
>> 
>> Please advise whether this fixable and how?
>> 
>> Thank you in advance!
>>

Re: Inverse English an digits in Arabic Text

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

> Doc in Arabic with some English - English text is inverted (for example,
"gro.echapa.www"), what makes search by key words impossible.

What very specifically do you mean by that. How do you see the inversion?

If that's within some sort of web ui, then you are probably seeing the HTML
bidi (bidirectional LTR/RTL) presentation issues.

And if you are seeing in in Cloudera UI, then the question may be for their
forum.

One way to test is to have English text in brackets "(www.apache.org)"
within Arabic flow. If you see again your issue but the brackets get weird
"((gro.....", this is most likely a bidi presentation issue with algorithm
or HTML attribute set to RTL.

Could be something else though, but that would be a start point.

Regards,
    Alex


On Mon., Sep. 7, 2020, 5:54 a.m. , <ad...@ukr.net> wrote:

> Hi,
>
> Could please help to resolve an issue. I upload/index several documents in
> English and in Arabic languages to SOLR, in addition I use handler for
> Arabic language:
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="50">
>    <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>                          <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                          <filter
> class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>                          <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>                           <filter
> class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
> There are two environments:
> Local machine:                 - SOLR version: 4,2
>                 - Windows version: 10
>
> DEV env:                 - SOLR version 4.1 as part of the cloudera suit
>                 - Linux core version: 3.10.0-862
>
> Issue appears when uploading documents:
> Local machine:                 - Doc in English with English words only -
> ok (for example, "www.apache.org")
>                 - Doc in Arabic with some English words - ok (for example,
> "www.apache.org")
>
> DEV env:                 - Doc in English with English words only - ok
> (for example, "www.apache.org")
>                 - Doc in Arabic with some English - English text is
> inverted (for example, "gro.echapa.www"), what makes search by key words
> impossible.
>
> Please advise whether this fixable and how?
>
> Thank you in advance!
>