You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Valentina Cavazza <va...@step-net.it> on 2016/07/06 14:04:27 UTC

help: Solr greek insensitive regex phrase query search

We created a new field type, this field type is used for a sentence that 
contains text in latin and old greek language
the text can include greek words with accents
we want to be able to do an accent insensitive search so for example:
if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 
with iota coronis accent.
Similarly if I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again 
want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.
I looked for solutions and i found the filter ASCIIFoldingFilterFactory
i installed that filter but do not make the correct job for greek language
<fieldType name="text_acs" class="solr.TextField" 
positionIncrementGap="1000">
       <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory" />
         <filter class="solr.ASCIIFoldingFilterFactory" />
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.GreekStemFilterFactory"/>
         </analyzer>
         <analyzer type="query">
             <tokenizer class="solr.StandardTokenizerFactory"/>
                 <filter class="solr.ASCIIFoldingFilterFactory" />
                 <filter class="solr.LowerCaseFilterFactory"/>
                 <filter class="solr.GreekStemFilterFactory"/>
         </analyzer>
    </fieldType>
If we use ICUFoldingFilterFactory filter, single word search works well 
but if we use a regex query or search for a phrase query, that we used 
before the filter ICUFoldingFilterFactory installation, do not work.
<fieldType name="text_acs" class="solr.TextField" 
positionIncrementGap="1000">
       <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory" />
         <filter class="solr.ICUFoldingFilterFactory" />
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.GreekStemFilterFactory"/>
         </analyzer>
         <analyzer type="query">
             <tokenizer class="solr.StandardTokenizerFactory"/>
                 <filter class="solr.ICUFoldingFilterFactory" />
                 <filter class="solr.LowerCaseFilterFactory"/>
                 <filter class="solr.GreekStemFilterFactory"/>
         </analyzer>
    </fieldType>
We have in the text field the word like this: <w ana='#n' 
xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want I find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 
with iota coronis accent.OK
If I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again find in the 
text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.OK
I also need that the user can be able to search the word and the tag 
container w: <w ana='#n'></w>



Re: help: Solr greek insensitive regex phrase query search

Posted by Valentina Cavazza <va...@step-net.it>.
Thanks for the answer,
on analysis page i see that solr ignore tags so simbols like <>='# and 
that treat like words (i use StandardTokenizerFactory)
so it do not matter if i only have to search in the field: <w ana='#n' 
xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
i can use a query like this: "w ana n \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2"~3
but if i want this word in the tag <w> inside another tag <foreign>
so i wonder to do queries like this:
<w ana='#n' *>word</w>
<w ana='#adj' *>word</w>
<foreign ana='cdswInter'>*<w ana='#n' *>word</w>*</foreign>: in this 
case is important to find the final </foreign> match
I do not find nothing useful in solr documentation for this particular 
tag search.

Best,
Valentina

Il 06/07/2016 17:27, Erick Erickson ha scritto:
> What do you see if you use the admin/analysis page? That should give
> you a clue what's happening here....
>
> Best,
> Erick
>
> On Wed, Jul 6, 2016 at 7:04 AM, Valentina Cavazza <va...@step-net.it> wrote:
>> We created a new field type, this field type is used for a sentence that
>> contains text in latin and old greek language
>> the text can include greek words with accents
>> we want to be able to do an accent insensitive search so for example:
>> if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with
>> iota coronis accent.
>> Similarly if I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again want to
>> find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.
>> I looked for solutions and i found the filter ASCIIFoldingFilterFactory
>> i installed that filter but do not make the correct job for greek language
>> <fieldType name="text_acs" class="solr.TextField"
>> positionIncrementGap="1000">
>>        <analyzer type="index">
>>      <tokenizer class="solr.StandardTokenizerFactory" />
>>          <filter class="solr.ASCIIFoldingFilterFactory" />
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.GreekStemFilterFactory"/>
>>          </analyzer>
>>          <analyzer type="query">
>>              <tokenizer class="solr.StandardTokenizerFactory"/>
>>                  <filter class="solr.ASCIIFoldingFilterFactory" />
>>                  <filter class="solr.LowerCaseFilterFactory"/>
>>                  <filter class="solr.GreekStemFilterFactory"/>
>>          </analyzer>
>>     </fieldType>
>> If we use ICUFoldingFilterFactory filter, single word search works well but
>> if we use a regex query or search for a phrase query, that we used before
>> the filter ICUFoldingFilterFactory installation, do not work.
>> <fieldType name="text_acs" class="solr.TextField"
>> positionIncrementGap="1000">
>>        <analyzer type="index">
>>      <tokenizer class="solr.StandardTokenizerFactory" />
>>          <filter class="solr.ICUFoldingFilterFactory" />
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.GreekStemFilterFactory"/>
>>          </analyzer>
>>          <analyzer type="query">
>>              <tokenizer class="solr.StandardTokenizerFactory"/>
>>                  <filter class="solr.ICUFoldingFilterFactory" />
>>                  <filter class="solr.LowerCaseFilterFactory"/>
>>                  <filter class="solr.GreekStemFilterFactory"/>
>>          </analyzer>
>>     </fieldType>
>> We have in the text field the word like this: <w ana='#n'
>> xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
>> if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want I find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with
>> iota coronis accent.OK
>> If I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again find in the text
>> the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.OK
>> I also need that the user can be able to search the word and the tag
>> container w: <w ana='#n'></w>
>>
>>


-- 

Valentina Cavazza
*STEP srl*
Tel. 011.98.66.277 / 0121.37.47.27
Fax. 011.98.66.728
E-mail. valentina@step-net.it
Web. www.step-net.it <http://www.step-net.it>


Re: help: Solr greek insensitive regex phrase query search

Posted by Erick Erickson <er...@gmail.com>.
What do you see if you use the admin/analysis page? That should give
you a clue what's happening here....

Best,
Erick

On Wed, Jul 6, 2016 at 7:04 AM, Valentina Cavazza <va...@step-net.it> wrote:
> We created a new field type, this field type is used for a sentence that
> contains text in latin and old greek language
> the text can include greek words with accents
> we want to be able to do an accent insensitive search so for example:
> if i search the word βιβλος i want to find in the text the word βίβλος with
> iota coronis accent.
> Similarly if I search the word βίβλος with iota acute accent i again want to
> find in the text the word βίβλος with iota coronis accent.
> I looked for solutions and i found the filter ASCIIFoldingFilterFactory
> i installed that filter but do not make the correct job for greek language
> <fieldType name="text_acs" class="solr.TextField"
> positionIncrementGap="1000">
>       <analyzer type="index">
>     <tokenizer class="solr.StandardTokenizerFactory" />
>         <filter class="solr.ASCIIFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.GreekStemFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.ASCIIFoldingFilterFactory" />
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.GreekStemFilterFactory"/>
>         </analyzer>
>    </fieldType>
> If we use ICUFoldingFilterFactory filter, single word search works well but
> if we use a regex query or search for a phrase query, that we used before
> the filter ICUFoldingFilterFactory installation, do not work.
> <fieldType name="text_acs" class="solr.TextField"
> positionIncrementGap="1000">
>       <analyzer type="index">
>     <tokenizer class="solr.StandardTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.GreekStemFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.ICUFoldingFilterFactory" />
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.GreekStemFilterFactory"/>
>         </analyzer>
>    </fieldType>
> We have in the text field the word like this: <w ana='#n'
> xml:lang='grc-Grek'>βίβλος</w>
> if i search the word βιβλος i want I find in the text the word βίβλος with
> iota coronis accent.OK
> If I search the word βίβλος with iota acute accent i again find in the text
> the word βίβλος with iota coronis accent.OK
> I also need that the user can be able to search the word and the tag
> container w: <w ana='#n'></w>
>
>