You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/09/04 01:14:25 UTC

Re: Searching with or without diacritics

Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the 
ISOLatin1AccentFilterFactory.

: Date: Thu, 27 Aug 2009 16:30:08 +0200
: From: "[ISO-8859-1] Gy�rgy Frivolt" <gy...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user <so...@lucene.apache.org>
: Subject: Searching with or without diacritics
: 
: Hello,
: 
:      I started to use solr only recently using the ruby/rails sunspot-solr
: client. I use solr on a slovak/czech data set and realized one not wanted
: behaviour of the search. When the user searches an expression or word which
: contains dicritics, letters like š, č, ť, ä, ô,... usually the special
: characters are omitted in the search query. In this case solr does not
: return records which contain the expression intended to be found by the
: user.
:      How can I configure solr in a way, that it founds records containing
: special characters, even if they are without special accents in the query?
: 
:      Some info about my solr instance: Solr Specification Version: 1.3.0Solr
: Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
: 11:06:47Lucene Specification Version: 2.4-devLucene Implementation Version:
: 2.4-dev 691741 - 2008-09-03 15:25:16
: 
: Thank for your help, regards,
: 
:      Georg
: 



-Hoss

Re: Searching with or without diacritics

Posted by AHMET ARSLAN <io...@yahoo.com>.
> Hi,
> 
> Thanks for the suggestions, perhaps I am closer to the
> goal, but still don't
> get the result. I would like to find accented characters
> (mapped by the
> MappingCharFilterFactory) by writing unaccented queries. On
> this page:
> 
> http://issues.ez.no/IssueView.php?Id=14742&activeItem=2
> 
> I've found that the MappCharFilter should be added to both
> the index and
> query type of analyzers.... I heard of these two types now
> for first. Is
> this the issue? I did not have so far any my analyzers
> marked with type "index" neither "query".

Since it is not marked with type "index" neither "query", it used for both.

Can you try this fieldType and give feedback: 

<fieldtype class='solr.TextField' name='text' positionIncrementGap='100'>
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-ISOLatin1Accent.txt"/>
          <tokenizer class="solr.CharStreamAwareWhitespaceTokenizerFactory"/>
          <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>
</fieldtype>

Just to make sure: 

You are using latest nightly build of solr, right?

mapping-ISOLatin1Accent.txt file - under the conf directory - contains the character mappings that you want to replace?

Just FYI StandardFilter is meaningless without StandardTokenizer. So i removed it from you field type.

Hope this helps.


      

Re: Searching with or without diacritics

Posted by György Frivolt <fi...@gmail.com>.
Hi,

Thanks for the suggestions, perhaps I am closer to the goal, but still don't
get the result. I would like to find accented characters (mapped by the
MappingCharFilterFactory) by writing unaccented queries. On this page:

http://issues.ez.no/IssueView.php?Id=14742&activeItem=2

I've found that the MappCharFilter should be added to both the index and
query type of analyzers.... I heard of these two types now for first. Is
this the issue? I did not have so far any my analyzers marked with type
"index" neither "query".

Now I use these schema.xml snippets.

    <fieldtype class='solr.TextField' name='text'
positionIncrementGap='100'>
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory"
          mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>
    </fieldtype>

and

    <field indexed='true' multiValued='true' name='text' stored='false'
type='text' />

Please excuse my little knowledge in solr...very curious about cracking this
nut. Thanks!

Georg


On Thu, Sep 17, 2009 at 7:26 PM, AHMET ARSLAN <io...@yahoo.com> wrote:

> > The sequence of the TokenizerChain is
> > not correct... Filters must be after tokenizer:
>
> Correct for TokenFilter(s), wrong for charFilter(s).
>
> MappingCharFilterFactory comes before tokenizer.
>
>
>
>
>
>
>

Re: Searching with or without diacritics

Posted by AHMET ARSLAN <io...@yahoo.com>.
> The sequence of the TokenizerChain is
> not correct... Filters must be after tokenizer:

Correct for TokenFilter(s), wrong for charFilter(s).

MappingCharFilterFactory comes before tokenizer.





      

Re: Searching with or without diacritics

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
The sequence of the TokenizerChain is not correct... Filters must be 
after tokenizer:

      <analyzer>
        <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
      </analyzer>


Koji


György Frivolt wrote:
> I tried to use ISOLatin1AccentFilterFactory under solr 1.3 . It partly
> works, but does not recognize most of the characters I need to map. So I
> tried to use MappingCharFilterFactory based on the documentation it needs a
> different tokenizer, I set it, and also a mapping file, this is a simple txt
> with char mappings. This would be fine for me, I tried it but does nothing.
> I suspect that it cannot locate the mapping file.
>
> The mapping-ISOLatin1Accent.txt is placed to my conf. I tried to change the
> path in the schema, but nothing happens. How can I tell solr to read this
> mapping file?
>
> This is my schema.xml:
>
> <?xml version='1.0' encoding='utf-8' ?>
> <schema name='sunspot' version='0.9'>
>   <types>
>     <fieldtype class='solr.TextField' name='text'
> positionIncrementGap='100'>
>       <analyzer>
>         <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
>         <filter class='solr.StandardFilterFactory' />
>         <filter class='solr.LowerCaseFilterFactory' />
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>     </fieldtype>
>     <fieldtype class='solr.RandomSortField' name='rand'></fieldtype>
>     <fieldtype class='solr.BoolField' name='boolean' omitNorms='true' />
>     <fieldtype class='solr.SortableFloatField' name='sfloat'
> omitNorms='true' />
>     <fieldtype class='solr.DateField' name='date' omitNorms='true' />
>     <fieldtype class='solr.SortableIntField' name='sint' omitNorms='true' />
>     <fieldtype class='solr.StrField' name='string' omitNorms='true' />
>   </types>
>   <fields>
>     <field indexed='true' multiValued='false' name='id' stored='true'
> type='string' />
>     <field indexed='true' multiValued='true' name='type' stored='false'
> type='string' />
>     <field indexed='true' multiValued='false' name='class_name'
> stored='false' type='string' />
>     <field indexed='true' multiValued='true' name='text' stored='false'
> type='text' />
>     <dynamicField indexed='true' multiValued='true' name='*_text'
> stored='false' type='text' />
>     <dynamicField indexed='true' name='random_*' stored='false' type='rand'
> />
>     <dynamicField indexed='true' multiValued='false' name='*_b'
> stored='false' type='boolean' />
>     <dynamicField indexed='true' multiValued='false' name='*_f'
> stored='false' type='sfloat' />
>     <dynamicField indexed='true' multiValued='false' name='*_d'
> stored='false' type='date' />
>     <dynamicField indexed='true' multiValued='false' name='*_i'
> stored='false' type='sint' />
>     <dynamicField indexed='true' multiValued='false' name='*_s'
> stored='false' type='string' />
>     <dynamicField indexed='true' multiValued='true' name='*_bm'
> stored='false' type='boolean' />
>     <dynamicField indexed='true' multiValued='true' name='*_fm'
> stored='false' type='sfloat' />
>     <dynamicField indexed='true' multiValued='true' name='*_dm'
> stored='false' type='date' />
>     <dynamicField indexed='true' multiValued='true' name='*_im'
> stored='false' type='sint' />
>     <dynamicField indexed='true' multiValued='true' name='*_sm'
> stored='false' type='string' />
>     <dynamicField indexed='true' multiValued='false' name='*_bs'
> stored='true' type='boolean' />
>     <dynamicField indexed='true' multiValued='false' name='*_fs'
> stored='true' type='sfloat' />
>     <dynamicField indexed='true' multiValued='false' name='*_ds'
> stored='true' type='date' />
>     <dynamicField indexed='true' multiValued='false' name='*_is'
> stored='true' type='sint' />
>     <dynamicField indexed='true' multiValued='false' name='*_ss'
> stored='true' type='string' />
>     <dynamicField indexed='true' multiValued='true' name='*_bms'
> stored='true' type='boolean' />
>     <dynamicField indexed='true' multiValued='true' name='*_fms'
> stored='true' type='sfloat' />
>     <dynamicField indexed='true' multiValued='true' name='*_dms'
> stored='true' type='date' />
>     <dynamicField indexed='true' multiValued='true' name='*_ims'
> stored='true' type='sint' />
>     <dynamicField indexed='true' multiValued='true' name='*_sms'
> stored='true' type='string' />
>   </fields>
>   <uniqueKey>id</uniqueKey>
>   <defaultSearchField>text</defaultSearchField>
>   <solrQueryParser defaultOperator='AND' />
>   <copyField dest='text' source='*_text' />
> </schema>
>
> On Fri, Sep 4, 2009 at 1:14 AM, Chris Hostetter <ho...@fucit.org>wrote:
>
>   
>> Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the
>> ISOLatin1AccentFilterFactory.
>>
>> : Date: Thu, 27 Aug 2009 16:30:08 +0200
>> : From: "[ISO-8859-1] György Frivolt" <gy...@gmail.com>
>> : Reply-To: solr-user@lucene.apache.org
>> : To: solr-user <so...@lucene.apache.org>
>> : Subject: Searching with or without diacritics
>> :
>> : Hello,
>> :
>> :      I started to use solr only recently using the ruby/rails
>> sunspot-solr
>> : client. I use solr on a slovak/czech data set and realized one not wanted
>> : behaviour of the search. When the user searches an expression or word
>> which
>> : contains dicritics, letters like š, č, ť, ä, ô,... usually the special
>> : characters are omitted in the search query. In this case solr does not
>> : return records which contain the expression intended to be found by the
>> : user.
>> :      How can I configure solr in a way, that it founds records containing
>> : special characters, even if they are without special accents in the
>> query?
>> :
>> :      Some info about my solr instance: Solr Specification Version:
>> 1.3.0Solr
>> : Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
>> : 11:06:47Lucene Specification Version: 2.4-devLucene Implementation
>> Version:
>> : 2.4-dev 691741 - 2008-09-03 15:25:16
>> :
>> : Thank for your help, regards,
>> :
>> :      Georg
>> :
>>
>>
>>
>> -Hoss
>>
>>     
>
>   


Re: Searching with or without diacritics

Posted by György Frivolt <fi...@gmail.com>.
I tried to use ISOLatin1AccentFilterFactory under solr 1.3 . It partly
works, but does not recognize most of the characters I need to map. So I
tried to use MappingCharFilterFactory based on the documentation it needs a
different tokenizer, I set it, and also a mapping file, this is a simple txt
with char mappings. This would be fine for me, I tried it but does nothing.
I suspect that it cannot locate the mapping file.

The mapping-ISOLatin1Accent.txt is placed to my conf. I tried to change the
path in the schema, but nothing happens. How can I tell solr to read this
mapping file?

This is my schema.xml:

<?xml version='1.0' encoding='utf-8' ?>
<schema name='sunspot' version='0.9'>
  <types>
    <fieldtype class='solr.TextField' name='text'
positionIncrementGap='100'>
      <analyzer>
        <!-- <tokenizer class='solr.StandardTokenizerFactory' /> -->
        <filter class='solr.StandardFilterFactory' />
        <filter class='solr.LowerCaseFilterFactory' />
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype class='solr.RandomSortField' name='rand'></fieldtype>
    <fieldtype class='solr.BoolField' name='boolean' omitNorms='true' />
    <fieldtype class='solr.SortableFloatField' name='sfloat'
omitNorms='true' />
    <fieldtype class='solr.DateField' name='date' omitNorms='true' />
    <fieldtype class='solr.SortableIntField' name='sint' omitNorms='true' />
    <fieldtype class='solr.StrField' name='string' omitNorms='true' />
  </types>
  <fields>
    <field indexed='true' multiValued='false' name='id' stored='true'
type='string' />
    <field indexed='true' multiValued='true' name='type' stored='false'
type='string' />
    <field indexed='true' multiValued='false' name='class_name'
stored='false' type='string' />
    <field indexed='true' multiValued='true' name='text' stored='false'
type='text' />
    <dynamicField indexed='true' multiValued='true' name='*_text'
stored='false' type='text' />
    <dynamicField indexed='true' name='random_*' stored='false' type='rand'
/>
    <dynamicField indexed='true' multiValued='false' name='*_b'
stored='false' type='boolean' />
    <dynamicField indexed='true' multiValued='false' name='*_f'
stored='false' type='sfloat' />
    <dynamicField indexed='true' multiValued='false' name='*_d'
stored='false' type='date' />
    <dynamicField indexed='true' multiValued='false' name='*_i'
stored='false' type='sint' />
    <dynamicField indexed='true' multiValued='false' name='*_s'
stored='false' type='string' />
    <dynamicField indexed='true' multiValued='true' name='*_bm'
stored='false' type='boolean' />
    <dynamicField indexed='true' multiValued='true' name='*_fm'
stored='false' type='sfloat' />
    <dynamicField indexed='true' multiValued='true' name='*_dm'
stored='false' type='date' />
    <dynamicField indexed='true' multiValued='true' name='*_im'
stored='false' type='sint' />
    <dynamicField indexed='true' multiValued='true' name='*_sm'
stored='false' type='string' />
    <dynamicField indexed='true' multiValued='false' name='*_bs'
stored='true' type='boolean' />
    <dynamicField indexed='true' multiValued='false' name='*_fs'
stored='true' type='sfloat' />
    <dynamicField indexed='true' multiValued='false' name='*_ds'
stored='true' type='date' />
    <dynamicField indexed='true' multiValued='false' name='*_is'
stored='true' type='sint' />
    <dynamicField indexed='true' multiValued='false' name='*_ss'
stored='true' type='string' />
    <dynamicField indexed='true' multiValued='true' name='*_bms'
stored='true' type='boolean' />
    <dynamicField indexed='true' multiValued='true' name='*_fms'
stored='true' type='sfloat' />
    <dynamicField indexed='true' multiValued='true' name='*_dms'
stored='true' type='date' />
    <dynamicField indexed='true' multiValued='true' name='*_ims'
stored='true' type='sint' />
    <dynamicField indexed='true' multiValued='true' name='*_sms'
stored='true' type='string' />
  </fields>
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>text</defaultSearchField>
  <solrQueryParser defaultOperator='AND' />
  <copyField dest='text' source='*_text' />
</schema>

On Fri, Sep 4, 2009 at 1:14 AM, Chris Hostetter <ho...@fucit.org>wrote:

>
> Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the
> ISOLatin1AccentFilterFactory.
>
> : Date: Thu, 27 Aug 2009 16:30:08 +0200
> : From: "[ISO-8859-1] György Frivolt" <gy...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user <so...@lucene.apache.org>
> : Subject: Searching with or without diacritics
> :
> : Hello,
> :
> :      I started to use solr only recently using the ruby/rails
> sunspot-solr
> : client. I use solr on a slovak/czech data set and realized one not wanted
> : behaviour of the search. When the user searches an expression or word
> which
> : contains dicritics, letters like š, č, ť, ä, ô,... usually the special
> : characters are omitted in the search query. In this case solr does not
> : return records which contain the expression intended to be found by the
> : user.
> :      How can I configure solr in a way, that it founds records containing
> : special characters, even if they are without special accents in the
> query?
> :
> :      Some info about my solr instance: Solr Specification Version:
> 1.3.0Solr
> : Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
> : 11:06:47Lucene Specification Version: 2.4-devLucene Implementation
> Version:
> : 2.4-dev 691741 - 2008-09-03 15:25:16
> :
> : Thank for your help, regards,
> :
> :      Georg
> :
>
>
>
> -Hoss
>