You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mohanmca01 <mo...@gmail.com> on 2017/02/01 11:50:18 UTC

Re: Arabic words search in solr

Dear Steve,Thanks for investigating our problem. Our project is basically
business directory search platform, and we have more than 100+ K business
details information. I’m providing you some examples of Arabic words to
reproduce the problem. please find attached word file where i explained
everything along with screenshots. arabicSearch.docx
<http://lucene.472066.n3.nabble.com/file/n4318227/arabicSearch.docx> 
regarding upgrading to the latest version, our project is running on Java
1.7V, and if i need to upgrade then we have to upgrade Java, Application
Server JBoos, and etc. which is not that right time to do this activity at
all..!!



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4318227.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Aman Deep Singh,

Thanks for the information.

We tried with EdgeNGramFilterFactory but it's not working....We are not
getting expected results. 

Can you please suggest us alternative possible ways..

Thanks,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Arabic words search in solr

Posted by Aman Deep Singh <am...@gmail.com>.
Try the edge ngram filter
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
I think it will help you solve the problem

On Sun, Aug 13, 2017 at 7:08 PM mohanmca01 <mo...@gmail.com> wrote:

> Hi Aman Deep Singh,
>
> Thanks for your update... I will update the status after complete the
> testing.
>
> I need one more help from your end,can you check below scenario:
>
> we are getting the results while using AND operator in between the words.
>
> Below is the example:
>
> Scenario 1:
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 1,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(مسقط AND الاتصال)",
>       "_": "1501998206658",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 44,
>     "start": 0,
>     "docs": [
>       {
>         "id": "56367",
>         "bizNameAr": "بنك مسقط - مركز الاتصال",
>         "_version_": 1574621133647380500
>       },
>       {
>         "id": "27224",
>         "bizNameAr": "بلدية مسقط -  - بلدية مسقط - مركز الاتصالات",
>         "_version_": 1574621132817956900
>       },
>       {
>         "id": "148922",
>         "bizNameAr": "بنك مسقط - ميثاق - مركز الاتصال",
>         "_version_": 1574621136335929300
>       },
>       {
>         "id": "23695",
>         "bizNameAr": "قوة السلطان الخاصة - مركز الإتصالات  - مسقط",
>         "_version_": 1574621132683739100
>       },
>       {
>         "id": "34992",
>         "bizNameAr": "طوارئ الكهرباء - محافظة مسقط - مركز الاتصال",
>         "_version_": 1574621133116801000
>       },
>       {
>         "id": "96500",
>         "bizNameAr": "شركة مسقط لتوزيع الكهرباء( ام اي دي سي)  - مركز
> الاتصال",
>         "_version_": 1574621134575370200
>       },
>       {
>         "id": "23966",
>         "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية
> العامة
> للاتصالات ونظم المعلومات -  - المديرية العامة للاتصالات ونظم المعلومات -
> البدالة",
>         "_version_": 1574621132692127700
>       },
>       {
>         "id": "24005",
>         "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية
> العامة
> للاتصالات ونظم المعلومات -  - مدير عام الاتصالات ونظم المعلومات -",
>         "_version_": 1574621132694225000
>       },
>       {
>         "id": "24026",
>         "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية
> العامة
> للاتصالات ونظم المعلومات -  - مساعد مدير عام الاتصالات ونظم المعلومات -",
>         "_version_": 1574621132694225000
>       },
>       {
>         "id": "24096",
>         "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية
> العامة
> للاتصالات ونظم المعلومات -  - مدير دائرة الاتصالات والصيانة -",
>         "_version_": 1574621132697370600
>       }
>     ]
>   }
> }
>
>
> Scenario 2:.
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 1,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(مسقط AND الات)",
>       "_": "1501998438821",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 0,
>     "start": 0,
>     "docs": []
>   }
> }
>
> We are expecting same results in the scenario 2 as well where am not typing
> the second word fully as in scenario’s 2 input.
>
>
> Below are the inputs used in both scenarios:
>
> Scenario 1:
> First word: مسقط
> Second word: الاتصال
>
> Scenario 2:
> First word: مسقط
> Second word: الات
>
> However, in our current production environment both of the above scenarios
> are working fine, but we have an issue of “Hamza” character where we are
> not
> getting results unless typing “Hamza” if it’s there.
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 9,
>     "params": {
>       "fl": "businessNmBl",
>       "indent": "true",
>       "q": "businessNmBl:شرطة إزكي",
>       "_": "1501997897849",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 1,
>     "start": 0,
>     "docs": [
>       {
>         "businessNmBl": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية
> -
> - مركز شرطة إزكي"
>       }
>     ]
>   }
> }
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4350392.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

RE: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Aman Deep Singh,

Thanks for your update... I will update the status after complete the
testing.

I need one more help from your end,can you check below scenario:

we are getting the results while using AND operator in between the words. 

Below is the example: 

Scenario 1:

{ 
  "responseHeader": { 
    "status": 0, 
    "QTime": 1, 
    "params": { 
      "indent": "true", 
      "q": "bizNameAr:(مسقط AND الاتصال)", 
      "_": "1501998206658", 
      "wt": "json" 
    } 
  }, 
  "response": { 
    "numFound": 44, 
    "start": 0, 
    "docs": [ 
      { 
        "id": "56367", 
        "bizNameAr": "بنك مسقط - مركز الاتصال", 
        "_version_": 1574621133647380500 
      }, 
      { 
        "id": "27224", 
        "bizNameAr": "بلدية مسقط -  - بلدية مسقط - مركز الاتصالات", 
        "_version_": 1574621132817956900 
      }, 
      { 
        "id": "148922", 
        "bizNameAr": "بنك مسقط - ميثاق - مركز الاتصال", 
        "_version_": 1574621136335929300 
      }, 
      { 
        "id": "23695", 
        "bizNameAr": "قوة السلطان الخاصة - مركز الإتصالات  - مسقط", 
        "_version_": 1574621132683739100 
      }, 
      { 
        "id": "34992", 
        "bizNameAr": "طوارئ الكهرباء - محافظة مسقط - مركز الاتصال", 
        "_version_": 1574621133116801000 
      }, 
      { 
        "id": "96500", 
        "bizNameAr": "شركة مسقط لتوزيع الكهرباء( ام اي دي سي)  - مركز
الاتصال", 
        "_version_": 1574621134575370200 
      }, 
      { 
        "id": "23966", 
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - المديرية العامة للاتصالات ونظم المعلومات -
البدالة", 
        "_version_": 1574621132692127700 
      }, 
      { 
        "id": "24005", 
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مدير عام الاتصالات ونظم المعلومات -", 
        "_version_": 1574621132694225000 
      }, 
      { 
        "id": "24026", 
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مساعد مدير عام الاتصالات ونظم المعلومات -", 
        "_version_": 1574621132694225000 
      }, 
      { 
        "id": "24096", 
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مدير دائرة الاتصالات والصيانة -", 
        "_version_": 1574621132697370600 
      } 
    ] 
  } 
} 


Scenario 2:. 

{ 
  "responseHeader": { 
    "status": 0, 
    "QTime": 1, 
    "params": { 
      "indent": "true", 
      "q": "bizNameAr:(مسقط AND الات)", 
      "_": "1501998438821", 
      "wt": "json" 
    } 
  }, 
  "response": { 
    "numFound": 0, 
    "start": 0, 
    "docs": [] 
  } 
} 

We are expecting same results in the scenario 2 as well where am not typing
the second word fully as in scenario’s 2 input. 


Below are the inputs used in both scenarios: 

Scenario 1:
First word: مسقط 
Second word: الاتصال 

Scenario 2:
First word: مسقط 
Second word: الات 

However, in our current production environment both of the above scenarios
are working fine, but we have an issue of “Hamza” character where we are not
getting results unless typing “Hamza” if it’s there. 

{ 
  "responseHeader": { 
    "status": 0, 
    "QTime": 9, 
    "params": { 
      "fl": "businessNmBl", 
      "indent": "true", 
      "q": "businessNmBl:شرطة إزكي", 
      "_": "1501997897849", 
      "wt": "json" 
    } 
  }, 
  "response": { 
    "numFound": 1, 
    "start": 0, 
    "docs": [ 
      { 
        "businessNmBl": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  - 
- مركز شرطة إزكي" 
      } 
    ] 
  } 
} 

Thanks,



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4350392.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by Aman Deep Singh <am...@gmail.com>.
You can configure mm either in the request handler sorconfig.xml or pass as
a request parameter along side the user query
For more detail refer
 https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

example of sample handler is

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <int name="rows">10</int>
    <str name="df">searchFields</str>
    <str name="mm">100%</str>
    <str name="defType">dismax</str>
  </lst>

On 13-Aug-2017 6:43 PM, "mohanmca01" <mo...@gmail.com> wrote:

Hi Aman Deep,

Thanks for the information, In order to add mm=100% in the request handler,
in which place ?..Can you please share me sample snap. thanks in advance.






--
View this message in context:
http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4350389.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Aman Deep,

Thanks for the information, In order to add mm=100% in the request handler,
in which place ?..Can you please share me sample snap. thanks in advance.






--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4350389.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by Aman Deep Singh <am...@gmail.com>.
Use mm=100% in the request handler
It will give the same AND functionality


On 06-Aug-2017 11:59 AM, "mohanmca01" <mo...@gmail.com> wrote:

hello Allison.

thank you for the information.

i referred to your slide "33", yes we are looking for same kind of results
and solution.

would you please guide us on how to achieve this?

also, we would like to know Instead of putting AND operator in between the
words if there is another way of doing this by adding this in configuration
level.

thanks



--
View this message in context: http://lucene.472066.n3.
nabble.com/Arabic-words-search-in-solr-tp4317733p4349259.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
hello Allison.

thank you for the information.

i referred to your slide "33", yes we are looking for same kind of results
and solution.

would you please guide us on how to achieve this?

also, we would like to know Instead of putting AND operator in between the
words if there is another way of doing this by adding this in configuration
level.

thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4349259.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Any one help me on below use case.



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4350390.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Dave,

Yes we are getting the results while using AND operator in between the
words.

Below is the example:

*Scenario 1:*

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": "bizNameAr:(مسقط AND الاتصال)",
      "_": "1501998206658",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 44,
    "start": 0,
    "docs": [
      {
        "id": "56367",
        "bizNameAr": "بنك مسقط - مركز الاتصال",
        "_version_": 1574621133647380500
      },
      {
        "id": "27224",
        "bizNameAr": "بلدية مسقط -  - بلدية مسقط - مركز الاتصالات",
        "_version_": 1574621132817956900
      },
      {
        "id": "148922",
        "bizNameAr": "بنك مسقط - ميثاق - مركز الاتصال",
        "_version_": 1574621136335929300
      },
      {
        "id": "23695",
        "bizNameAr": "قوة السلطان الخاصة - مركز الإتصالات  - مسقط",
        "_version_": 1574621132683739100
      },
      {
        "id": "34992",
        "bizNameAr": "طوارئ الكهرباء - محافظة مسقط - مركز الاتصال",
        "_version_": 1574621133116801000
      },
      {
        "id": "96500",
        "bizNameAr": "شركة مسقط لتوزيع الكهرباء( ام اي دي سي)  - مركز
الاتصال",
        "_version_": 1574621134575370200
      },
      {
        "id": "23966",
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - المديرية العامة للاتصالات ونظم المعلومات -
البدالة",
        "_version_": 1574621132692127700
      },
      {
        "id": "24005",
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مدير عام الاتصالات ونظم المعلومات -",
        "_version_": 1574621132694225000
      },
      {
        "id": "24026",
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مساعد مدير عام الاتصالات ونظم المعلومات -",
        "_version_": 1574621132694225000
      },
      {
        "id": "24096",
        "bizNameAr": "ديوان البلاط السلطاني - القصر - مسقط - المديرية العامة
للاتصالات ونظم المعلومات -  - مدير دائرة الاتصالات والصيانة -",
        "_version_": 1574621132697370600
      }
    ]
  }
}


*Scenario 2:*.

{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": "bizNameAr:(مسقط AND الات)",
      "_": "1501998438821",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

We are expecting same results in the scenario 2 as well where am not typing
the second word fully as in scenario’s 2 input.


Below are the inputs used in both scenarios:

*Scenario 1:*
First word: مسقط
Second word: الاتصال

*Scenario 2:*
First word: مسقط
Second word: الات

However, in our current production environment both of the above scenarios
are working fine, but we have an issue of “Hamza” character where we are not
getting results unless typing “Hamza” if it’s there.

{
  "responseHeader": {
    "status": 0,
    "QTime": 9,
    "params": {
      "fl": "businessNmBl",
      "indent": "true",
      "q": "businessNmBl:شرطة إزكي",
      "_": "1501997897849",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "businessNmBl": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  - 
- مركز شرطة إزكي"
      }
    ]
  }
}






--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4349258.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
+1

I was hoping to use this as a case for arguing for turning off an overly aggressive stemmer, but I checked on your 10 docs and query, and David is right, of course -- if you change the default operator to AND, you only get the one document back that you had intended to.

I can still use this as a case for getting on my Unicode normalization soapbox and +1'ing your use of the ICUFoldingFilter.  With no token filters, you get 4 results; when you add the ICUFoldingFilter, you get 8 results; and when you add in the Arabic stemmer, you get all 10.  Not that you need this, but see slide 33 of [1], where we show 78 Unicode variants for "America" in ~800k docs in an Arabic script language.  Without Unicode normalization, users might get 1/2 the documents back or far, far fewer...and they wouldn't even know what they were missing!

[1] https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf

-----Original Message-----
From: David Hastings [mailto:hastings.recursive@gmail.com] 
Sent: Wednesday, August 2, 2017 9:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

perhaps change your default operator to AND instead of OR if thats what you are expecting for a result

On Wed, Aug 2, 2017 at 8:57 AM, mohanmca01 <mo...@gmail.com> wrote:

> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, 
> but it was getting wrong results. i was searching for "شرطة ازكي" and 
> it was showing me the result that am looking for, plus irrelevant 
> result which either have the first or second word that i have typed while searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 3,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(شرطة ازكي)",
>       "_": "1501678260335",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 444,
>     "start": 0,
>     "docs": [
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  
> -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>       {
>         "id": "13937",
>         "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
>         "_version_": 1574621132197200000
>       },
>       {
>         "id": "15914",
>         "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
>         "_version_": 1574621132344000500
>       },
>       {
>         "id": "20639",
>         "bizNameAr": "سحائب ازكي للتجارة",
>         "_version_": 1574621132574687200
>       },
>       {
>         "id": "25108",
>         "bizNameAr": "المستشفيات -  - مستشفى إزكي",
>         "_version_": 1574621132737216500
>       },
>       {
>         "id": "27629",
>         "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
>         "_version_": 1574621132833685500
>       },
>       {
>         "id": "36351",
>         "bizNameAr": "طوارئ الكهرباء - إزكي",
>         "_version_": 1574621133183910000
>       },
>       {
>         "id": "61235",
>         "bizNameAr": "اضواء ازكي للتجارة",
>         "_version_": 1574621133785792500
>       },
>       {
>         "id": "66821",
>         "bizNameAr": "أطلال إزكي للتجارة",
>         "_version_": 1574621133915816000
>       },
>       {
>         "id": "67011",
>         "bizNameAr": "بنك ظفار - فرع ازكي",
>         "_version_": 1574621133920010200
>       }
>     ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the 
> words that we typed while searching:
>
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  
> -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
>     <field name="bizNameAr" type="text_ar" indexed="true" 
> stored="true"/>
>
>
>     <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>         <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>                 <filter class="solr.ICUFoldingFilterFactory"/>
>                 <filter class="solr.HyphenatedWordsFilterFactory"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ى"
> replacement="ئ"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ء"
> replacement=""/>
>       </analyzer>
>     </fieldType>
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Arabic words search in solr

Posted by David Hastings <ha...@gmail.com>.
perhaps change your default operator to AND instead of OR if thats what you
are expecting for a result

On Wed, Aug 2, 2017 at 8:57 AM, mohanmca01 <mo...@gmail.com> wrote:

> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, but it
> was getting wrong results. i was searching for "شرطة ازكي" and it was
> showing me the result that am looking for, plus irrelevant result which
> either have the first or second word that i have typed while searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 3,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(شرطة ازكي)",
>       "_": "1501678260335",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 444,
>     "start": 0,
>     "docs": [
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>       {
>         "id": "13937",
>         "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
>         "_version_": 1574621132197200000
>       },
>       {
>         "id": "15914",
>         "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
>         "_version_": 1574621132344000500
>       },
>       {
>         "id": "20639",
>         "bizNameAr": "سحائب ازكي للتجارة",
>         "_version_": 1574621132574687200
>       },
>       {
>         "id": "25108",
>         "bizNameAr": "المستشفيات -  - مستشفى إزكي",
>         "_version_": 1574621132737216500
>       },
>       {
>         "id": "27629",
>         "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
>         "_version_": 1574621132833685500
>       },
>       {
>         "id": "36351",
>         "bizNameAr": "طوارئ الكهرباء - إزكي",
>         "_version_": 1574621133183910000
>       },
>       {
>         "id": "61235",
>         "bizNameAr": "اضواء ازكي للتجارة",
>         "_version_": 1574621133785792500
>       },
>       {
>         "id": "66821",
>         "bizNameAr": "أطلال إزكي للتجارة",
>         "_version_": 1574621133915816000
>       },
>       {
>         "id": "67011",
>         "bizNameAr": "بنك ظفار - فرع ازكي",
>         "_version_": 1574621133920010200
>       }
>     ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the words
> that we typed while searching:
>
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> -
> مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
>     <field name="bizNameAr" type="text_ar" indexed="true" stored="true"/>
>
>
>     <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>         <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
>                 <filter class="solr.ICUFoldingFilterFactory"/>
>                 <filter class="solr.HyphenatedWordsFilterFactory"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ى"
> replacement="ئ"/>
>                 <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="ء"
> replacement=""/>
>       </analyzer>
>     </fieldType>
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Arabic words search in solr

Posted by Tim Casey <tc...@gmail.com>.
There should be a way to use a phrasal query for the specific names.

On Wed, Aug 2, 2017 at 2:15 PM, Phil Scadden <P....@gns.cri.nz> wrote:

> Hopefully changing to default AND solves your problem. If so, I would be
> quite interested in what your index config looks like in the end. I also
> have upcoming need to index Arabic words.
>
> -----Original Message-----
> From: mohanmca01 [mailto:mohanmca01@gmail.com]
> Sent: Thursday, 3 August 2017 12:58 a.m.
> To: solr-user@lucene.apache.org
> Subject: RE: Arabic words search in solr
>
> Hi Phil Scadden,
>
>  Thank you for your reply,
>
> we tried your suggested solution by removing hyphen while indexing, but it
> was getting wrong results. i was searching for "شرطة ازكي" and it was
> showing me the result that am looking for, plus irrelevant result which
> either have the first or second word that i have typed while searching.
>
> First word: شرطة
> Second Word: ازكي
>
> results that we are getting:
>
>
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 3,
>     "params": {
>       "indent": "true",
>       "q": "bizNameAr:(شرطة ازكي)",
>       "_": "1501678260335",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 444,
>     "start": 0,
>     "docs": [
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> - مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>       {
>         "id": "13937",
>         "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
>         "_version_": 1574621132197200000
>       },
>       {
>         "id": "15914",
>         "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
>         "_version_": 1574621132344000500
>       },
>       {
>         "id": "20639",
>         "bizNameAr": "سحائب ازكي للتجارة",
>         "_version_": 1574621132574687200
>       },
>       {
>         "id": "25108",
>         "bizNameAr": "المستشفيات -  - مستشفى إزكي",
>         "_version_": 1574621132737216500
>       },
>       {
>         "id": "27629",
>         "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
>         "_version_": 1574621132833685500
>       },
>       {
>         "id": "36351",
>         "bizNameAr": "طوارئ الكهرباء - إزكي",
>         "_version_": 1574621133183910000
>       },
>       {
>         "id": "61235",
>         "bizNameAr": "اضواء ازكي للتجارة",
>         "_version_": 1574621133785792500
>       },
>       {
>         "id": "66821",
>         "bizNameAr": "أطلال إزكي للتجارة",
>         "_version_": 1574621133915816000
>       },
>       {
>         "id": "67011",
>         "bizNameAr": "بنك ظفار - فرع ازكي",
>         "_version_": 1574621133920010200
>       }
>     ]
>   }
> }
>
> Actually  we expecting the below results only since it has both the words
> that we typed while searching:
>
>       {
>         "id": "28107",
>         "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -
> - مركز شرطة إزكي",
>         "_version_": 1574621132849414100
>       },
>
>
> Configuration:
>
> In schema.xml we configured as below:
>
>     <field name="bizNameAr" type="text_ar" indexed="true" stored="true"/>
>
>
>     <fieldType name="text_ar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>         <filter class="solr.ArabicNormalizationFilterFactory"/>
>         <filter class="solr.ArabicStemFilterFactory"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.HyphenatedWordsFilterFactory"/>
> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ى"
> replacement="ئ"/>
> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ء"
> replacement=""/>
>       </analyzer>
>     </fieldType>
>
>
> Thanks,
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

RE: Arabic words search in solr

Posted by Phil Scadden <P....@gns.cri.nz>.
Hopefully changing to default AND solves your problem. If so, I would be quite interested in what your index config looks like in the end. I also have upcoming need to index Arabic words.

-----Original Message-----
From: mohanmca01 [mailto:mohanmca01@gmail.com]
Sent: Thursday, 3 August 2017 12:58 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr

Hi Phil Scadden,

 Thank you for your reply,

we tried your suggested solution by removing hyphen while indexing, but it was getting wrong results. i was searching for "شرطة ازكي" and it was showing me the result that am looking for, plus irrelevant result which either have the first or second word that i have typed while searching.

First word: شرطة
Second Word: ازكي

results that we are getting:


{
  "responseHeader": {
    "status": 0,
    "QTime": 3,
    "params": {
      "indent": "true",
      "q": "bizNameAr:(شرطة ازكي)",
      "_": "1501678260335",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 444,
    "start": 0,
    "docs": [
      {
        "id": "28107",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  - مركز شرطة إزكي",
        "_version_": 1574621132849414100
      },
      {
        "id": "13937",
        "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
        "_version_": 1574621132197200000
      },
      {
        "id": "15914",
        "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
        "_version_": 1574621132344000500
      },
      {
        "id": "20639",
        "bizNameAr": "سحائب ازكي للتجارة",
        "_version_": 1574621132574687200
      },
      {
        "id": "25108",
        "bizNameAr": "المستشفيات -  - مستشفى إزكي",
        "_version_": 1574621132737216500
      },
      {
        "id": "27629",
        "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
        "_version_": 1574621132833685500
      },
      {
        "id": "36351",
        "bizNameAr": "طوارئ الكهرباء - إزكي",
        "_version_": 1574621133183910000
      },
      {
        "id": "61235",
        "bizNameAr": "اضواء ازكي للتجارة",
        "_version_": 1574621133785792500
      },
      {
        "id": "66821",
        "bizNameAr": "أطلال إزكي للتجارة",
        "_version_": 1574621133915816000
      },
      {
        "id": "67011",
        "bizNameAr": "بنك ظفار - فرع ازكي",
        "_version_": 1574621133920010200
      }
    ]
  }
}

Actually  we expecting the below results only since it has both the words that we typed while searching:

      {
        "id": "28107",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  - مركز شرطة إزكي",
        "_version_": 1574621132849414100
      },


Configuration:

In schema.xml we configured as below:

    <field name="bizNameAr" type="text_ar" indexed="true" stored="true"/>


    <fieldType name="text_ar" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ى"
replacement="ئ"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ء"
replacement=""/>
      </analyzer>
    </fieldType>


Thanks,





--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Phil Scadden,

 Thank you for your reply,

we tried your suggested solution by removing hyphen while indexing, but it
was getting wrong results. i was searching for "شرطة ازكي" and it was
showing me the result that am looking for, plus irrelevant result which
either have the first or second word that i have typed while searching.

First word: شرطة 
Second Word: ازكي

results that we are getting:


{
  "responseHeader": {
    "status": 0,
    "QTime": 3,
    "params": {
      "indent": "true",
      "q": "bizNameAr:(شرطة ازكي)",
      "_": "1501678260335",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 444,
    "start": 0,
    "docs": [
      {
        "id": "28107",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  -
مركز شرطة إزكي",
        "_version_": 1574621132849414100
      },
      {
        "id": "13937",
        "bizNameAr": "مؤسسةا الازكي للتجارة والمقاولات",
        "_version_": 1574621132197200000
      },
      {
        "id": "15914",
        "bizNameAr": "العلوي والازكي المتحدة ش.م.م",
        "_version_": 1574621132344000500
      },
      {
        "id": "20639",
        "bizNameAr": "سحائب ازكي للتجارة",
        "_version_": 1574621132574687200
      },
      {
        "id": "25108",
        "bizNameAr": "المستشفيات -  - مستشفى إزكي",
        "_version_": 1574621132737216500
      },
      {
        "id": "27629",
        "bizNameAr": "وزارة الداخلية -  -  - والي إزكي -",
        "_version_": 1574621132833685500
      },
      {
        "id": "36351",
        "bizNameAr": "طوارئ الكهرباء - إزكي",
        "_version_": 1574621133183910000
      },
      {
        "id": "61235",
        "bizNameAr": "اضواء ازكي للتجارة",
        "_version_": 1574621133785792500
      },
      {
        "id": "66821",
        "bizNameAr": "أطلال إزكي للتجارة",
        "_version_": 1574621133915816000
      },
      {
        "id": "67011",
        "bizNameAr": "بنك ظفار - فرع ازكي",
        "_version_": 1574621133920010200
      }
    ]
  }
}

Actually  we expecting the below results only since it has both the words
that we typed while searching:

      {
        "id": "28107",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية  -  -
مركز شرطة إزكي",
        "_version_": 1574621132849414100
      },


Configuration:

In schema.xml we configured as below:

    <field name="bizNameAr" type="text_ar" indexed="true" stored="true"/>

    
    <fieldType name="text_ar" class="solr.TextField"
positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
		<filter class="solr.ICUFoldingFilterFactory"/>
		<filter class="solr.HyphenatedWordsFilterFactory"/>
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ى"
replacement="ئ"/> 
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ء"
replacement=""/> 
      </analyzer>
    </fieldType>


Thanks,





--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348774.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Arabic words search in solr

Posted by Phil Scadden <P....@gns.cri.nz>.
Further to that. What results do you get when you put those indexed terms into the Analysis tool on the Solr UI?

-----Original Message-----
From: Phil Scadden [mailto:P.Scadden@gns.cri.nz]
Sent: Tuesday, 1 August 2017 9:06 a.m.
To: solr-user@lucene.apache.org
Subject: RE: Arabic words search in solr

Am I correct in assuming that you have the problem searching only when there is a hyphen in your indexed text? If you, then it would suggest that you need to use a different tokenizer when indexing - it looks like the hyphen is removed and words each side are concatenated - hence need both terms to find the text.

-----Original Message-----
From: mohanmca01 [mailto:mohanmca01@gmail.com]
Sent: Tuesday, 1 August 2017 1:18 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

Please help me on this...



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348372.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: Arabic words search in solr

Posted by Phil Scadden <P....@gns.cri.nz>.
Am I correct in assuming that you have the problem searching only when there is a hyphen in your indexed text? If you, then it would suggest that you need to use a different tokenizer when indexing - it looks like the hyphen is removed and words each side are concatenated - hence need both terms to find the text.

-----Original Message-----
From: mohanmca01 [mailto:mohanmca01@gmail.com]
Sent: Tuesday, 1 August 2017 1:18 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Arabic words search in solr

Please help me on this...



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348372.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Please help me on this...



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4348372.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Steve,

thank you for your reply, it been quite long time to reply to you back.

i have tried what you suggested, and there were some improvements in terms
of searching and getting the results.

however, the team is facing some difficulty in searching using shortcut of
the indexed names which forced us to revert back the suggested changes..

below are the examples which we are facing:


---------------------------------
*Example 1:*

*Indexed Text*
بنك مسقط - مركز الاتصال

*Searched*
مسقط الات

*Remarks of Example 1*
unable to get the indexed result unless I typed the two words fully (مسقط
الاتصال)


{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "indent": "true",
      "q": "businessNmBl:(مسقط الات)",
      "_": "1499758511717",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}


---------------------------------

*Example 2:*

*Indexed Text
*الطيران العماني - مركز الاتصال

*Searched*
الطير الات

*Remarks*
unable to get the indexed result unless I typed the two words fully (الطيران
الاتصال)


{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "indent": "true",
      "q": "businessNmBl:(طير الات)",
      "_": "1499758649600",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}



Please be noted that the existing configuration (which we are facing
problems with Hamzzh (ء) and etc. )  on production is working with the above
examples. its not working only once we implement your suggested
configuration. 

Thanks in advance





--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4345392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

Your examples refer to documents I don’t have in my 9 document set, so I recast the problem to a query/doc combo I have from earlier in this thread, and I was able to restrict hits to only documents that contained all terms from the query.

If I use the query “name_ar:(شرطة ازكي)” I get 3 hits (I’ve left out some details):

-----
{ "responseHeader": { ... "params": { "q":"name_ar:(شرطة ازكي)”, ... } },
  "response": { "numFound":3, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"], ... },
      { "id":"3", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء”], ... },
      { "id":"8", "name_ar":["وزارة الصحة - المديرية العامة للخدمات الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  - الطوارئ”], ... }]}
-----

If I add “q.op=AND” to the request, only one of these documents matches - note that I’ve also checked the “debugQuery” option on the Admin UI:

-----
{ "responseHeader": { … 
  "params": { "q":"name_ar:(شرطة ازكي)”, "q.op":"AND”, "debugQuery":“true”, ... } },
  "response": { "numFound":1, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي”], ... }]},
  "debug": {
    "rawquerystring": "name_ar:(شرطة ازكي)",
    "querystring": "name_ar:(شرطة ازكي)",
    "parsedquery": "+name_ar:شرط +name_ar:ازك",
    "parsedquery_toString": "+name_ar:شرط +name_ar:ازك",
-----

Note the “parsedquery" above - it shows how to require individual terms when specifying the field for each term.  This is how the "name_ar:(شرطة ازكي)” query is interpreted when the "q.op=AND” request param is used.

The equivalent query using ‘+’ signs is: "name_ar:(+شرطة +ازكي)”.  This *looks* strange because of how the Unicode bidirectional algorithm works.  This W3C writeup uses Arabic to drive its discussion of display of strings that contain both RTL and LTR character runs, and I found it quite helpful here: <https://www.w3.org/International/articles/inline-bidi-markup/uba-basics>.

Here’s the output from the "name_ar:(+شرطة +ازكي)” query:

-----
{ "responseHeader": { ... "params": { "q":"name_ar:(+شرطة +ازكي)", "debugQuery":“true” ... } },
  "response": { "numFound":1, "start":0,
    "docs": [
      { "id":"6", "name_ar":["شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي”], ... }]},
  "debug": {
    "rawquerystring": "name_ar:(+شرطة +ازكي)",
    "querystring": "name_ar:(+شرطة +ازكي)",
    "parsedquery": "+name_ar:شرط +name_ar:ازك",
    "parsedquery_toString": "+name_ar:شرط +name_ar:ازك",
-----

The above is the same result (and has the same parsedQuery) as query "name_ar:(شرطة ازكي)” with request param “q.op=AND”.

I won’t show it here, but I get the same 1-hit result for this query when I use AND instead of ‘+’: "name_ar:(شرطة AND ازكي)” - note that the terms only *appear* to be in reverse order because of how the Unicode bidirectional algorithm works.

> On Mar 9, 2017, at 2:30 AM, mohanmca01 <mo...@gmail.com> wrote:
> 
> I saw your products in lucidworks website. Do you have any solr arabic
> support customized product?

Lucidworks doesn’t have a specifically Arabic-focused product, but we have helped people enable Arabic search in the past.  Click on the “Contact Us” link on the website if you’d like to talk to us about getting involved.

--
Steve
www.lucidworks.com


Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Stave,

Thanks for the support, I tried below cases but still i'm not able to get
the expected results.

Case 1 :

Input :  bizNameAr:شرطة + ازكي

Output : {

  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": " bizNameAr:شرطة + ازكي",
      "_": "1489041466096",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 4,
    "start": 0,
    "docs": [
      {
        "id": "82",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية
- - مركز شرطة إزكي",
        "_version_": 1560298301338681300
      },
      {
        "id": "63",
        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي",
        "_version_": 1560298301325049900
      },
      {
        "id": "56",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال
الشرقية  -  - مركز شرطة إبراء",
        "_version_": 1560298301319807000
      },
      {
        "id": "79",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال
الشرقية - - مركز شرطة إبراء",
        "_version_": 1560298301335535600
      }
    ]
  }
}


In this case document id : 63,56,79 are not matching with the input,
where id 82 is the only correct in these results.



Case 2:


{
  "responseHeader": {
    "status": 0,
    "QTime": 3,
    "params": {
      "indent": "true",
      "q": " bizNameAr:شرطة AND ازكي",
      "_": "1489043935549",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}


if AND is given in between of the terms then no results are shown.

I saw your products in lucidworks website. Do you have any solr arabic
support customized product?

Thanks,



On Thu, Mar 2, 2017 at 7:01 PM, sarowe [via Lucene] <
ml-node+s472066n4323036h21@n3.nabble.com> wrote:

> Hi Mohan,
>
> > On Feb 26, 2017, at 1:37 AM, mohanmca01 <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4323036&i=0>> wrote:
> >
> > i searched with (bizNameAr: شرطة ازكي), and am getting:
> > […]
> >
> > the expected result is:   "id": "82",
> >                                  "bizNameAr": "شرطة عمان السلطانية -
> قيادة
> > شرطة محافظة الداخلية - - مركز *شرطة إزكي*",
> >
> > as the above has both the words mentioned in the query (marked as Bold),
> > where the rest have the following:
> >
> >        "id": "63",
> >        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي"
> >
> > it has only one word of the query (ازكي)
> >
> >        "id": "56",
> >        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال
> الشرقية
> > -  - مركز شرطة إبراء"
> >
> > it has only one word of the query (شرطة)
> >
> > "id": "79",
> > "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - -
> مركز
> > شرطة إبراء"
> >
> > It has only one word of the query (شرطة)
> >
> > where the above 3 records should not come in the result since already 2
> > words mentioned in the query, and only one record has these two words.
>
> Solr's standard query language includes two mechanisms for requiring
> terms: ‘+’ before a required term, and ‘AND’ between two required terms.
>  ‘+’ is better - see <https://lucidworks.com/2011/
> 12/28/why-not-and-or-and-not/> for more information.
>
> You can also set the default operator to ‘AND’, e.g. via request parameter
> “&q.op=AND” (if this is always what you want, you can include this in the
> /select request handler’s definition in solrconfig.xml).  See <
> https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser>
> for more information.
>
> > I would really suggest if we can give you a real-time demo on our system
> > with my Arab colleague so it can be more clear for you. let us know if
> we
> > can do that.
>
> I prefer to keep discussion on this public mailing list so that others can
> benefit.  If you find that you need faster or more interactive help, you
> can check out the list of people who have indicated that they provide Solr
> support: <https://wiki.apache.org/solr/Support>.
>
> --
> Steve
> www.lucidworks.com
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-
> tp4317733p4323036.html
> To unsubscribe from Arabic words search in solr, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4317733&code=bW9oYW5tY2EwMUBnbWFpbC5jb218NDMxNzczM3wxOTczODE3MDQy>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Regards,
Mohan.N
9865998919




--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4324142.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

> On Feb 26, 2017, at 1:37 AM, mohanmca01 <mo...@gmail.com> wrote:
> 
> i searched with (bizNameAr: شرطة ازكي), and am getting:
> […]
> 
> the expected result is:   "id": "82",
>                                  "bizNameAr": "شرطة عمان السلطانية - قيادة
> شرطة محافظة الداخلية - - مركز *شرطة إزكي*",
> 
> as the above has both the words mentioned in the query (marked as Bold),
> where the rest have the following:
> 
>        "id": "63",
>        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي"
> 
> it has only one word of the query (ازكي)
> 
>        "id": "56",
>        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية 
> -  - مركز شرطة إبراء"
> 
> it has only one word of the query (شرطة)
> 
> "id": "79",
> "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز
> شرطة إبراء"
> 
> It has only one word of the query (شرطة)
> 
> where the above 3 records should not come in the result since already 2
> words mentioned in the query, and only one record has these two words.

Solr's standard query language includes two mechanisms for requiring terms: ‘+’ before a required term, and ‘AND’ between two required terms.  ‘+’ is better - see <https://lucidworks.com/2011/12/28/why-not-and-or-and-not/> for more information.

You can also set the default operator to ‘AND’, e.g. via request parameter “&q.op=AND” (if this is always what you want, you can include this in the /select request handler’s definition in solrconfig.xml).  See <https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser> for more information.  

> I would really suggest if we can give you a real-time demo on our system
> with my Arab colleague so it can be more clear for you. let us know if we
> can do that.

I prefer to keep discussion on this public mailing list so that others can benefit.  If you find that you need faster or more interactive help, you can check out the list of people who have indicated that they provide Solr support: <https://wiki.apache.org/solr/Support>.

--
Steve
www.lucidworks.com


Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Stave, 

Any update on this.....



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4323005.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Stave,

Thank for your continues investigation..

This has improved the search little bit, but am facing another issue where
am getting a record doesn't have a specific word in my query. 

Plz note that you have indexed only 9 records where i have shared you more
than 76 sample records (please refer to the earlier attachment
Arabic_Characters2.xlsx in Examples sheet) to index so you can reproduce the
issue. 

i.e. i searched with (bizNameAr: شرطة ازكي), and am getting:

{
  "responseHeader": {
    "status": 0,
    "QTime": 3,
    "params": {
      "indent": "true",
      "q": "bizNameAr: شرطة ازكي",
      "_": "1488089550104",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 4,
    "start": 0,
    "docs": [
      {
        "id": "82",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - -
مركز شرطة إزكي",
        "_version_": 1560298301338681300
      },
      {
        "id": "63",
        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي",
        "_version_": 1560298301325049900
      },
      {
        "id": "56",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية 
-  - مركز شرطة إبراء",
        "_version_": 1560298301319807000
      },
      {
        "id": "79",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية -
- مركز شرطة إبراء",
        "_version_": 1560298301335535600
      }
    ]
  }
}



the expected result is:   "id": "82",
                                  "bizNameAr": "شرطة عمان السلطانية - قيادة
شرطة محافظة الداخلية - - مركز *شرطة إزكي*",

as the above has both the words mentioned in the query (marked as Bold),
where the rest have the following:

        "id": "63",
        "bizNameAr": "شركة ظفار للتأمين ش.م.ع.ع - فرع ازكي"

it has only one word of the query (ازكي)

        "id": "56",
        "bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية 
-  - مركز شرطة إبراء"

it has only one word of the query (شرطة)

"id": "79",
"bizNameAr": "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز
شرطة إبراء"

It has only one word of the query (شرطة)

where the above 3 records should not come in the result since already 2
words mentioned in the query, and only one record has these two words.


I would really suggest if we can give you a real-time demo on our system
with my Arab colleague so it can be more clear for you. let us know if we
can do that.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4322354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

I indexed your 9 examples as simple documents after mapping dynamic field “*_ar” to the “text_ar” field type:

-----
[{"id":"1", "name_ar":"المؤسسة التجارية العمانية"},
{"id":"2", "name_ar":"شركة التأمين الأهلية ش.م.ع.م"},
{"id":"3", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"},
{"id":"4", "name_ar":"شركة ظفار للتأمين ش.م.ع.ع"},
{"id":"5", "name_ar":"طوارئ المستشفيات   - طوارئ مستشفى صحار"},
{"id":"6", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"},
{"id":"7", "name_ar":"المؤسسة التجارية العمانية"},
{"id":"8", "name_ar":"وزارة الصحة - المديرية العامة للخدمات الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  - الطوارئ"},
{"id":"9", "name_ar":"أسعار المكالمات الدولية - مونتسرات -  - مونتسرات”}]
-----

Then when I search from the Admin UI for “name_ar:شرطة ازكي” (the query in one of your screenshots with numFound=0) I get the following results:

-----
{
  "responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "indent": "true",
      "q": "name_ar:شرطة ازكي",
      "_": "1487912340325",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "id": "6",
        "name_ar": [
          "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"
        ],
        "_version_": 1560170434794619000
      },
      {
        "id": "3",
        "name_ar": [
          "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"
        ],
        "_version_": 1560170434793570300
      }
    ]
  }
}
-----

So I cannot reproduce the failures you’re seeing.  In fact, I tried all 9 of the queries you listed as not working, and all of them matched at least one of the above 9 documents, except for case 5 (which I give details for below).  Are you absolutely sure that you reindexed your data with the ICUFF last?

The one query that didn’t return any matches for me is “name_ar:طوارى صحار”.  Here’s why:

Indexed original: طوارئ صحار
Indexed analyzed: طواري صحار

Query original: طوارى صحار
Query analyzed: طوار صحار

In the analyzed indexed form, the “ئ” (yeh with hamza above) is left intact by ArabicNormalizationFilter and ArabicStemFilter, and then the ICUFoldingFilter converts it to “ي” (yeh without the hamza).

In the analyzed query, ArabicNormalizationFilter converts “طوارى” to “طواري” (alef maksura->yeh), which ArabicStemFilter converts to “طوار” by removing the trailing yeh.

I don’t know what the correct thing to do is to make alef maksura and yeh match each other, but one possibility is adding a char filter that converts all alefs maksura into yehs with hamza, like this:

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="ى" replacement="ئ”/>

When I added the above to my “text_ar" field type and re-indexed, I got the following when I queried for “name_ar:طوارى صحار”:

-----
{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "indent": "true",
      "q": "name_ar:طوارى صحار",
      "_": "1487915432177",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "id": "5",
        "name_ar": [
          "طوارئ المستشفيات   - طوارئ مستشفى صحار"
        ],
        "_version_": 1560192353894924300
      },
      {
        "id": "8",
        "name_ar": [
          "وزارة الصحة - المديرية العامة للخدمات الصحية  محافظة الداخلية -  - مستشفى إزكي (البدالة)  - الطوارئ"
        ],
        "_version_": 1560192353895972900
      }
    ]
  }
}
-----

--
Steve
www.lucidworks.com

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Stave,

As per your suggestion I added ICU folding filter and I re-indexed entire
solr data, but still am unable to find the expected results which i
highlighted earlier.

attached excel sheet with examples of Arabic names for your investigation &
reproducing the issue.
Arabic_Characters2.xlsx
<http://lucene.472066.n3.nabble.com/file/n4321582/Arabic_Characters2.xlsx>  

thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321582.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

It looks to me like the example query should match, since the analyzed query terms look like a subset of the analyzed document terms.

Did you re-index your docuemnts after you changed your schema?  If not, then the indexed documents won’t have the same terms as the ones you see on the Admin UI Analysis pane.

If you have re-indexed, and are still not getting matches you expect, please include textual examples of the remaining problems, so that I can copy/paste to reproduce the problem - I can’t copy/paste Arabic from images you pointed to.

--
Steve
www.lucidworks.com

> On Feb 21, 2017, at 1:28 AM, mohanmca01 <mo...@gmail.com> wrote:
> 
> Hi Steve,
> 
> I changed ICU folding filter order and re-index entire Arabic content. But
> still problem is present. I am not able to get the expected result.
> 
> I attached screen shot for your references.
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin.png> 
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%281%29.png> 
> <http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%282%29.png> 
> 
> Kindly check and let me know.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321397.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Steve,

I changed ICU folding filter order and re-index entire Arabic content. But
still problem is present. I am not able to get the expected result.

I attached screen shot for your references.
<http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin.png> 
<http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%281%29.png> 
<http://lucene.472066.n3.nabble.com/file/n4321397/Solr_Admin%282%29.png> 

Kindly check and let me know.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4321397.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

When I said "the ICU folding filter should be the last filter, to allow the Arabic normalization and stemming filters to see the original words”, I meant that no filter should follow it.  

You did not make that change.

Here’s what I mean:

   <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
     <analyzer> 
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
       <filter class="solr.ArabicNormalizationFilterFactory"/>
       <filter class="solr.ArabicStemFilterFactory"/>
       <filter class="solr.ICUFoldingFilterFactory"/>
     </analyzer>
   </fieldType>

--
Steve
www.lucidworks.com

> On Feb 15, 2017, at 12:23 AM, mohanmca01 <mo...@gmail.com> wrote:
> 
> Hi Steve,
> 
> As per your suggestion,I added ICUFoldingFilterFactory in schema.xml as
> below:
> 
> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
>      <analyzer> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>        <filter class="solr.ArabicStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> I attached expecting result document in previous mail thread for your
> references.
> 
> Kindly check and let me know.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320427.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Steve,

As per your suggestion,I added ICUFoldingFilterFactory in schema.xml as
below:

<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>

I attached expecting result document in previous mail thread for your
references.

Kindly check and let me know.

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320427.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

Did you change the order of the filters as I suggested?

--
Steve
eww.lucidworks.com

On Tue, Feb 14, 2017 at 8:05 AM mohanmca01 <mo...@gmail.com> wrote:

> Hi Steve,
>
> any update on this .???.. I am waiting for your inputs..
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320253.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Steve,

any update on this .???.. I am waiting for your inputs..



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4320253.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

I haven’t looked at the latest problems, but the ICU folding filter should be the last filter, to allow the Arabic normalization and stemming filters to see the original words.

--
Steve
www.lucidworks.com

> On Feb 8, 2017, at 10:58 PM, mohanmca01 <mo...@gmail.com> wrote:
> 
> Hi Steve,
> 
> Thanks for your continues investigation on this issue.
> 
> I added ICU Folding Filter in schema.xml file and re-indexed all the data
> again. i noticed some improvements in search but its not really as expected.
> 
> below is the configuration changed in schema file:
> 
> -----------------
> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
>      <analyzer> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>         <filter class="solr.ICUFoldingFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_ar.txt" />
> 
>        <filter class="solr.ArabicNormalizationFilterFactory"/>
>        <filter class="solr.ArabicStemFilterFactory"/>
>      </analyzer>
>    </fieldType>
> -----------------
> 
> attached the document for your reference where highlighted ones in red are
> not working as expected.
> 
> Also, i have raised one point regarding Jquery autocomplete with unique
> records..kindly let me know if you have any background on how to implement
> the same.
> 
> arabicSearch.docx
> <http://lucene.472066.n3.nabble.com/file/n4319436/arabicSearch.docx>  
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4319436.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Arabic words search in solr

Posted by mohanmca01 <mo...@gmail.com>.
Hi Steve,

Thanks for your continues investigation on this issue.

I added ICU Folding Filter in schema.xml file and re-indexed all the data
again. i noticed some improvements in search but its not really as expected.

below is the configuration changed in schema file:

-----------------
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        
         <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_ar.txt" />
        
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>
-----------------

attached the document for your reference where highlighted ones in red are
not working as expected.

Also, i have raised one point regarding Jquery autocomplete with unique
records..kindly let me know if you have any background on how to implement
the same.

arabicSearch.docx
<http://lucene.472066.n3.nabble.com/file/n4319436/arabicSearch.docx>  


 



--
View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4319436.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arabic words search in solr

Posted by Steve Rowe <sa...@gmail.com>.
Hi Mohan,

I ran your Case #1 through Solr 4.9.0’s Admin UI Analysis pane and I can see the analyzer for the field type “text_ar" analyzer does not remove all diacritics:

Indexed original: المؤسسة التجارية العمانية
Indexed analyzed: مؤسس تجار عمان

Query original: الموسسة التجارية
Query analyzed: موسس تجار

The analyzed query terms are the same as the first two analyzed indexed terms, with one exception: the hamza on the waw in the analyzed indexed term “مؤسس” was not stripped off by the analyzer, and so won’t match the analyzed query term “موسس”, which was entered by the user without the hamza.

Adding ICUFoldingFilterFactory to the “text_ar” field type fixed case #1 for me by stripping the hamza from the waw.  You can read more about this filter in the Solr Reference Guide (yes, this is basically for Solr 6.4, but I don’t think this functionality has changed between 4.9 and 6.4): <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ICUFoldingFilter>.  If you do this, you can remove the LowerCaseFilterFactory since ICUFoldingFilterFactory performs lowercasing as part of its work.

Note that to use ICUFoldingFilterFactory you must add three jars to the lib/ directory in your solr home dir.  Here’s how I did it:

$ mkdir example/solr/lib
$ cp dist/solr-analysis-extras-4.9.0.jar example/solr/lib/
$ cp contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.9.0.jar example/solr/lib/
$ cp contrib/analysis-extras/lib/icu4j-53.1.jar example/solr/lib/

--
Steve
www.lucidworks.com 

> On Feb 1, 2017, at 6:50 AM, mohanmca01 <mo...@gmail.com> wrote:
> 
> Dear Steve,Thanks for investigating our problem. Our project is basically
> business directory search platform, and we have more than 100+ K business
> details information. I’m providing you some examples of Arabic words to
> reproduce the problem. please find attached word file where i explained
> everything along with screenshots. arabicSearch.docx
> <http://lucene.472066.n3.nabble.com/file/n4318227/arabicSearch.docx> 
> regarding upgrading to the latest version, our project is running on Java
> 1.7V, and if i need to upgrade then we have to upgrade Java, Application
> Server JBoos, and etc. which is not that right time to do this activity at
> all..!!
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Arabic-words-search-in-solr-tp4317733p4318227.html
> Sent from the Solr - User mailing list archive at Nabble.com.