You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Valentina Cavazza <va...@step-net.it> on 2016/07/06 14:04:27 UTC
help: Solr greek insensitive regex phrase query search
We created a new field type, this field type is used for a sentence that
contains text in latin and old greek language
the text can include greek words with accents
we want to be able to do an accent insensitive search so for example:
if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2
with iota coronis accent.
Similarly if I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again
want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.
I looked for solutions and i found the filter ASCIIFoldingFilterFactory
i installed that filter but do not make the correct job for greek language
<fieldType name="text_acs" class="solr.TextField"
positionIncrementGap="1000">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
</fieldType>
If we use ICUFoldingFilterFactory filter, single word search works well
but if we use a regex query or search for a phrase query, that we used
before the filter ICUFoldingFilterFactory installation, do not work.
<fieldType name="text_acs" class="solr.TextField"
positionIncrementGap="1000">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
</fieldType>
We have in the text field the word like this: <w ana='#n'
xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want I find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2
with iota coronis accent.OK
If I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again find in the
text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.OK
I also need that the user can be able to search the word and the tag
container w: <w ana='#n'></w>
Re: help: Solr greek insensitive regex phrase query search
Posted by Valentina Cavazza <va...@step-net.it>.
Thanks for the answer,
on analysis page i see that solr ignore tags so simbols like <>='# and
that treat like words (i use StandardTokenizerFactory)
so it do not matter if i only have to search in the field: <w ana='#n'
xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
i can use a query like this: "w ana n \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2"~3
but if i want this word in the tag <w> inside another tag <foreign>
so i wonder to do queries like this:
<w ana='#n' *>word</w>
<w ana='#adj' *>word</w>
<foreign ana='cdswInter'>*<w ana='#n' *>word</w>*</foreign>: in this
case is important to find the final </foreign> match
I do not find nothing useful in solr documentation for this particular
tag search.
Best,
Valentina
Il 06/07/2016 17:27, Erick Erickson ha scritto:
> What do you see if you use the admin/analysis page? That should give
> you a clue what's happening here....
>
> Best,
> Erick
>
> On Wed, Jul 6, 2016 at 7:04 AM, Valentina Cavazza <va...@step-net.it> wrote:
>> We created a new field type, this field type is used for a sentence that
>> contains text in latin and old greek language
>> the text can include greek words with accents
>> we want to be able to do an accent insensitive search so for example:
>> if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want to find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with
>> iota coronis accent.
>> Similarly if I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again want to
>> find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.
>> I looked for solutions and i found the filter ASCIIFoldingFilterFactory
>> i installed that filter but do not make the correct job for greek language
>> <fieldType name="text_acs" class="solr.TextField"
>> positionIncrementGap="1000">
>> <analyzer type="index">
>> <tokenizer class="solr.StandardTokenizerFactory" />
>> <filter class="solr.ASCIIFoldingFilterFactory" />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.GreekStemFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.ASCIIFoldingFilterFactory" />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.GreekStemFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> If we use ICUFoldingFilterFactory filter, single word search works well but
>> if we use a regex query or search for a phrase query, that we used before
>> the filter ICUFoldingFilterFactory installation, do not work.
>> <fieldType name="text_acs" class="solr.TextField"
>> positionIncrementGap="1000">
>> <analyzer type="index">
>> <tokenizer class="solr.StandardTokenizerFactory" />
>> <filter class="solr.ICUFoldingFilterFactory" />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.GreekStemFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.ICUFoldingFilterFactory" />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.GreekStemFilterFactory"/>
>> </analyzer>
>> </fieldType>
>> We have in the text field the word like this: <w ana='#n'
>> xml:lang='grc-Grek'>\u03b2\u03af\u03b2\u03bb\u03bf\u03c2</w>
>> if i search the word \u03b2\u03b9\u03b2\u03bb\u03bf\u03c2 i want I find in the text the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with
>> iota coronis accent.OK
>> If I search the word \u03b2\u1f77\u03b2\u03bb\u03bf\u03c2 with iota acute accent i again find in the text
>> the word \u03b2\u03af\u03b2\u03bb\u03bf\u03c2 with iota coronis accent.OK
>> I also need that the user can be able to search the word and the tag
>> container w: <w ana='#n'></w>
>>
>>
--
Valentina Cavazza
*STEP srl*
Tel. 011.98.66.277 / 0121.37.47.27
Fax. 011.98.66.728
E-mail. valentina@step-net.it
Web. www.step-net.it <http://www.step-net.it>
Re: help: Solr greek insensitive regex phrase query search
Posted by Erick Erickson <er...@gmail.com>.
What do you see if you use the admin/analysis page? That should give
you a clue what's happening here....
Best,
Erick
On Wed, Jul 6, 2016 at 7:04 AM, Valentina Cavazza <va...@step-net.it> wrote:
> We created a new field type, this field type is used for a sentence that
> contains text in latin and old greek language
> the text can include greek words with accents
> we want to be able to do an accent insensitive search so for example:
> if i search the word βιβλος i want to find in the text the word βίβλος with
> iota coronis accent.
> Similarly if I search the word βίβλος with iota acute accent i again want to
> find in the text the word βίβλος with iota coronis accent.
> I looked for solutions and i found the filter ASCIIFoldingFilterFactory
> i installed that filter but do not make the correct job for greek language
> <fieldType name="text_acs" class="solr.TextField"
> positionIncrementGap="1000">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory" />
> <filter class="solr.ASCIIFoldingFilterFactory" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.GreekStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.GreekStemFilterFactory"/>
> </analyzer>
> </fieldType>
> If we use ICUFoldingFilterFactory filter, single word search works well but
> if we use a regex query or search for a phrase query, that we used before
> the filter ICUFoldingFilterFactory installation, do not work.
> <fieldType name="text_acs" class="solr.TextField"
> positionIncrementGap="1000">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory" />
> <filter class="solr.ICUFoldingFilterFactory" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.GreekStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.ICUFoldingFilterFactory" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.GreekStemFilterFactory"/>
> </analyzer>
> </fieldType>
> We have in the text field the word like this: <w ana='#n'
> xml:lang='grc-Grek'>βίβλος</w>
> if i search the word βιβλος i want I find in the text the word βίβλος with
> iota coronis accent.OK
> If I search the word βίβλος with iota acute accent i again find in the text
> the word βίβλος with iota coronis accent.OK
> I also need that the user can be able to search the word and the tag
> container w: <w ana='#n'></w>
>
>