You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by György Frivolt <gy...@gmail.com> on 2009/08/27 16:30:08 UTC

Searching with or without diacritics

Hello,

     I started to use solr only recently using the ruby/rails sunspot-solr
client. I use solr on a slovak/czech data set and realized one not wanted
behaviour of the search. When the user searches an expression or word which
contains dicritics, letters like š, č, ť, ä, ô,... usually the special
characters are omitted in the search query. In this case solr does not
return records which contain the expression intended to be found by the
user.
     How can I configure solr in a way, that it founds records containing
special characters, even if they are without special accents in the query?

     Some info about my solr instance: Solr Specification Version: 1.3.0Solr
Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
11:06:47Lucene Specification Version: 2.4-devLucene Implementation Version:
2.4-dev 691741 - 2008-09-03 15:25:16

Thank for your help, regards,

     Georg

Re: Searching with or without diacritics

Posted by AHMET ARSLAN <io...@yahoo.com>.
> Hi,
> 
> Thanks for the suggestions, perhaps I am closer to the
> goal, but still don't
> get the result. I would like to find accented characters
> (mapped by the
> MappingCharFilterFactory) by writing unaccented queries. On
> this page:
> 
> http://issues.ez.no/IssueView.php?Id=14742&activeItem=2
> 
> I've found that the MappCharFilter should be added to both
> the index and
> query type of analyzers.... I heard of these two types now
> for first. Is
> this the issue? I did not have so far any my analyzers
> marked with type "index" neither "query".

Since it is not marked with type "index" neither "query", it used for both.

Can you try this fieldType and give feedback: 

<fieldtype class='solr.TextField' name='text' positionIncrementGap='100'>
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-ISOLatin1Accent.txt"/>
          <tokenizer class="solr.CharStreamAwareWhitespaceTokenizerFactory"/>
          <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>
</fieldtype>

Just to make sure: 

You are using latest nightly build of solr, right?

mapping-ISOLatin1Accent.txt file - under the conf directory - contains the character mappings that you want to replace?

Just FYI StandardFilter is meaningless without StandardTokenizer. So i removed it from you field type.

Hope this helps.


      

Re: Searching with or without diacritics

Posted by György Frivolt <fi...@gmail.com>.
Hi,

Thanks for the suggestions, perhaps I am closer to the goal, but still don't
get the result. I would like to find accented characters (mapped by the
MappingCharFilterFactory) by writing unaccented queries. On this page:

http://issues.ez.no/IssueView.php?Id=14742&activeItem=2

I've found that the MappCharFilter should be added to both the index and
query type of analyzers.... I heard of these two types now for first. Is
this the issue? I did not have so far any my analyzers marked with type
"index" neither "query".

Now I use these schema.xml snippets.

    <fieldtype class='solr.TextField' name='text'
positionIncrementGap='100'>
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>
    </fieldtype>

and

    <field indexed='true' multiValued='true' name='text' stored='false'
type='text' />

Please excuse my little knowledge in solr...very curious about cracking this
nut. Thanks!

Georg


On Thu, Sep 17, 2009 at 7:26 PM, AHMET ARSLAN <io...@yahoo.com> wrote:

> > The sequence of the TokenizerChain is
> > not correct... Filters must be after tokenizer:
>
> Correct for TokenFilter(s), wrong for charFilter(s).
>
> MappingCharFilterFactory comes before tokenizer.
>
>
>
>
>
>
>

Re: Searching with or without diacritics

Posted by AHMET ARSLAN <io...@yahoo.com>.
> The sequence of the TokenizerChain is
> not correct... Filters must be after tokenizer:

Correct for TokenFilter(s), wrong for charFilter(s).

MappingCharFilterFactory comes before tokenizer.





      

Re: Searching with or without diacritics

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
The sequence of the TokenizerChain is not correct... Filters must be 
after tokenizer:

      <analyzer>
        <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>


Koji


György Frivolt wrote:
> I tried to use ISOLatin1AccentFilterFactory under solr 1.3 . It partly
> works, but does not recognize most of the characters I need to map. So I
> tried to use MappingCharFilterFactory based on the documentation it needs a
> different tokenizer, I set it, and also a mapping file, this is a simple txt
> with char mappings. This would be fine for me, I tried it but does nothing.
> I suspect that it cannot locate the mapping file.
>
> The mapping-ISOLatin1Accent.txt is placed to my conf. I tried to change the
> path in the schema, but nothing happens. How can I tell solr to read this
> mapping file?
>
> This is my schema.xml:
>
> <?xml version='1.0' encoding='utf-8' ?>
> <schema name='sunspot' version='0.9'>
>   <types>
>     <fieldtype class='solr.TextField' name='text'
> positionIncrementGap='100'>
>       <analyzer>
>         <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
>         <filter class='solr.StandardFilterFactory' />
>         <filter class='solr.LowerCaseFilterFactory' />
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>     </fieldtype>
>     <fieldtype class='solr.RandomSortField' name='rand'></fieldtype>
>     <fieldtype class='solr.BoolField' name='boolean' omitNorms='true' />
>     <fieldtype class='solr.SortableFloatField' name='sfloat'
> omitNorms='true' />
>     <fieldtype class='solr.DateField' name='date' omitNorms='true' />
>     <fieldtype class='solr.SortableIntField' name='sint' omitNorms='true' />
>     <fieldtype class='solr.StrField' name='string' omitNorms='true' />
>   </types>
>   <fields>
>     <field indexed='true' multiValued='false' name='id' stored='true'
> type='string' />
>     <field indexed='true' multiValued='true' name='type' stored='false'
> type='string' />
>     <field indexed='true' multiValued='false' name='class_name'
> stored='false' type='string' />
>     <field indexed='true' multiValued='true' name='text' stored='false'
> type='text' />
>     <dynamicField indexed='true' multiValued='true' name='*_text'
> stored='false' type='text' />
>     <dynamicField indexed='true' name='random_*' stored='false' type='rand'
> />
>     <dynamicField indexed='true' multiValued='false' name='*_b'
> stored='false' type='boolean' />
>     <dynamicField indexed='true' multiValued='false' name='*_f'
> stored='false' type='sfloat' />
>     <dynamicField indexed='true' multiValued='false' name='*_d'
> stored='false' type='date' />
>     <dynamicField indexed='true' multiValued='false' name='*_i'
> stored='false' type='sint' />
>     <dynamicField indexed='true' multiValued='false' name='*_s'
> stored='false' type='string' />
>     <dynamicField indexed='true' multiValued='true' name='*_bm'
> stored='false' type='boolean' />
>     <dynamicField indexed='true' multiValued='true' name='*_fm'
> stored='false' type='sfloat' />
>     <dynamicField indexed='true' multiValued='true' name='*_dm'
> stored='false' type='date' />
>     <dynamicField indexed='true' multiValued='true' name='*_im'
> stored='false' type='sint' />
>     <dynamicField indexed='true' multiValued='true' name='*_sm'
> stored='false' type='string' />
>     <dynamicField indexed='true' multiValued='false' name='*_bs'
> stored='true' type='boolean' />
>     <dynamicField indexed='true' multiValued='false' name='*_fs'
> stored='true' type='sfloat' />
>     <dynamicField indexed='true' multiValued='false' name='*_ds'
> stored='true' type='date' />
>     <dynamicField indexed='true' multiValued='false' name='*_is'
> stored='true' type='sint' />
>     <dynamicField indexed='true' multiValued='false' name='*_ss'
> stored='true' type='string' />
>     <dynamicField indexed='true' multiValued='true' name='*_bms'
> stored='true' type='boolean' />
>     <dynamicField indexed='true' multiValued='true' name='*_fms'
> stored='true' type='sfloat' />
>     <dynamicField indexed='true' multiValued='true' name='*_dms'
> stored='true' type='date' />
>     <dynamicField indexed='true' multiValued='true' name='*_ims'
> stored='true' type='sint' />
>     <dynamicField indexed='true' multiValued='true' name='*_sms'
> stored='true' type='string' />
>   </fields>
>   <uniqueKey>id</uniqueKey>
>   <defaultSearchField>text</defaultSearchField>
>   <solrQueryParser defaultOperator='AND' />
>   <copyField dest='text' source='*_text' />
> </schema>
>
> On Fri, Sep 4, 2009 at 1:14 AM, Chris Hostetter <ho...@fucit.org>wrote:
>
>   
>> Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the
>> ISOLatin1AccentFilterFactory.
>>
>> : Date: Thu, 27 Aug 2009 16:30:08 +0200
>> : From: "[ISO-8859-1] György Frivolt" <gy...@gmail.com>
>> : Reply-To: solr-user@lucene.apache.org
>> : To: solr-user <so...@lucene.apache.org>
>> : Subject: Searching with or without diacritics
>> :
>> : Hello,
>> :
>> :      I started to use solr only recently using the ruby/rails
>> sunspot-solr
>> : client. I use solr on a slovak/czech data set and realized one not wanted
>> : behaviour of the search. When the user searches an expression or word
>> which
>> : contains dicritics, letters like š, č, ť, ä, ô,... usually the special
>> : characters are omitted in the search query. In this case solr does not
>> : return records which contain the expression intended to be found by the
>> : user.
>> :      How can I configure solr in a way, that it founds records containing
>> : special characters, even if they are without special accents in the
>> query?
>> :
>> :      Some info about my solr instance: Solr Specification Version:
>> 1.3.0Solr
>> : Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
>> : 11:06:47Lucene Specification Version: 2.4-devLucene Implementation
>> Version:
>> : 2.4-dev 691741 - 2008-09-03 15:25:16
>> :
>> : Thank for your help, regards,
>> :
>> :      Georg
>> :
>>
>>
>>
>> -Hoss
>>
>>     
>
>   


Re: Searching with or without diacritics

Posted by György Frivolt <fi...@gmail.com>.
I tried to use ISOLatin1AccentFilterFactory under solr 1.3 . It partly
works, but does not recognize most of the characters I need to map. So I
tried to use MappingCharFilterFactory based on the documentation it needs a
different tokenizer, I set it, and also a mapping file, this is a simple txt
with char mappings. This would be fine for me, I tried it but does nothing.
I suspect that it cannot locate the mapping file.

The mapping-ISOLatin1Accent.txt is placed to my conf. I tried to change the
path in the schema, but nothing happens. How can I tell solr to read this
mapping file?

This is my schema.xml:

<?xml version='1.0' encoding='utf-8' ?>
<schema name='sunspot' version='0.9'>
  <types>
    <fieldtype class='solr.TextField' name='text'
positionIncrementGap='100'>
      <analyzer>
        <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype class='solr.RandomSortField' name='rand'></fieldtype>
    <fieldtype class='solr.BoolField' name='boolean' omitNorms='true' />
    <fieldtype class='solr.SortableFloatField' name='sfloat'
omitNorms='true' />
    <fieldtype class='solr.DateField' name='date' omitNorms='true' />
    <fieldtype class='solr.SortableIntField' name='sint' omitNorms='true' />
    <fieldtype class='solr.StrField' name='string' omitNorms='true' />
  </types>
  <fields>
    <field indexed='true' multiValued='false' name='id' stored='true'
type='string' />
    <field indexed='true' multiValued='true' name='type' stored='false'
type='string' />
    <field indexed='true' multiValued='false' name='class_name'
stored='false' type='string' />
    <field indexed='true' multiValued='true' name='text' stored='false'
type='text' />
    <dynamicField indexed='true' multiValued='true' name='*_text'
stored='false' type='text' />
    <dynamicField indexed='true' name='random_*' stored='false' type='rand'
/>
    <dynamicField indexed='true' multiValued='false' name='*_b'
stored='false' type='boolean' />
    <dynamicField indexed='true' multiValued='false' name='*_f'
stored='false' type='sfloat' />
    <dynamicField indexed='true' multiValued='false' name='*_d'
stored='false' type='date' />
    <dynamicField indexed='true' multiValued='false' name='*_i'
stored='false' type='sint' />
    <dynamicField indexed='true' multiValued='false' name='*_s'
stored='false' type='string' />
    <dynamicField indexed='true' multiValued='true' name='*_bm'
stored='false' type='boolean' />
    <dynamicField indexed='true' multiValued='true' name='*_fm'
stored='false' type='sfloat' />
    <dynamicField indexed='true' multiValued='true' name='*_dm'
stored='false' type='date' />
    <dynamicField indexed='true' multiValued='true' name='*_im'
stored='false' type='sint' />
    <dynamicField indexed='true' multiValued='true' name='*_sm'
stored='false' type='string' />
    <dynamicField indexed='true' multiValued='false' name='*_bs'
stored='true' type='boolean' />
    <dynamicField indexed='true' multiValued='false' name='*_fs'
stored='true' type='sfloat' />
    <dynamicField indexed='true' multiValued='false' name='*_ds'
stored='true' type='date' />
    <dynamicField indexed='true' multiValued='false' name='*_is'
stored='true' type='sint' />
    <dynamicField indexed='true' multiValued='false' name='*_ss'
stored='true' type='string' />
    <dynamicField indexed='true' multiValued='true' name='*_bms'
stored='true' type='boolean' />
    <dynamicField indexed='true' multiValued='true' name='*_fms'
stored='true' type='sfloat' />
    <dynamicField indexed='true' multiValued='true' name='*_dms'
stored='true' type='date' />
    <dynamicField indexed='true' multiValued='true' name='*_ims'
stored='true' type='sint' />
    <dynamicField indexed='true' multiValued='true' name='*_sms'
stored='true' type='string' />
  </fields>
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>text</defaultSearchField>
  <solrQueryParser defaultOperator='AND' />
  <copyField dest='text' source='*_text' />
</schema>

On Fri, Sep 4, 2009 at 1:14 AM, Chris Hostetter <ho...@fucit.org>wrote:

>
> Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the
> ISOLatin1AccentFilterFactory.
>
> : Date: Thu, 27 Aug 2009 16:30:08 +0200
> : From: "[ISO-8859-1] György Frivolt" <gy...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user <so...@lucene.apache.org>
> : Subject: Searching with or without diacritics
> :
> : Hello,
> :
> :      I started to use solr only recently using the ruby/rails
> sunspot-solr
> : client. I use solr on a slovak/czech data set and realized one not wanted
> : behaviour of the search. When the user searches an expression or word
> which
> : contains dicritics, letters like š, č, ť, ä, ô,... usually the special
> : characters are omitted in the search query. In this case solr does not
> : return records which contain the expression intended to be found by the
> : user.
> :      How can I configure solr in a way, that it founds records containing
> : special characters, even if they are without special accents in the
> query?
> :
> :      Some info about my solr instance: Solr Specification Version:
> 1.3.0Solr
> : Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
> : 11:06:47Lucene Specification Version: 2.4-devLucene Implementation
> Version:
> : 2.4-dev 691741 - 2008-09-03 15:25:16
> :
> : Thank for your help, regards,
> :
> :      Georg
> :
>
>
>
> -Hoss
>

Re: Searching with or without diacritics

Posted by Chris Hostetter <ho...@fucit.org>.
Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the 
ISOLatin1AccentFilterFactory.

: Date: Thu, 27 Aug 2009 16:30:08 +0200
: From: "[ISO-8859-1] Gy�rgy Frivolt" <gy...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user <so...@lucene.apache.org>
: Subject: Searching with or without diacritics
: 
: Hello,
: 
:      I started to use solr only recently using the ruby/rails sunspot-solr
: client. I use solr on a slovak/czech data set and realized one not wanted
: behaviour of the search. When the user searches an expression or word which
: contains dicritics, letters like š, č, ť, ä, ô,... usually the special
: characters are omitted in the search query. In this case solr does not
: return records which contain the expression intended to be found by the
: user.
:      How can I configure solr in a way, that it founds records containing
: special characters, even if they are without special accents in the query?
: 
:      Some info about my solr instance: Solr Specification Version: 1.3.0Solr
: Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
: 11:06:47Lucene Specification Version: 2.4-devLucene Implementation Version:
: 2.4-dev 691741 - 2008-09-03 15:25:16
: 
: Thank for your help, regards,
: 
:      Georg
: 



-Hoss