You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Avlesh Singh <av...@gmail.com> on 2009/11/01 03:45:08 UTC

Re: Iso accents and wildcards

>
> When I request with title:econ* I can have the correct  answers, but if  I
> request  with  title:écon*  I  have no  answers.
> If I request with title:économ (the exact word of the index) it works, so
> there might be something wrong with the wildcard.
> As far as I can understand the analyser should be use exactly the same in
> both index and query time.
>
Wildcard queries are not analyzed and hence the "inconsistent" behaviour.
The easiest way out is to define one more field "title_orginal" as an
untokenized field. While querying, you can use both the fields at the same
time. e.g. q=(title:écon* title_orginal:écon*). In any case, you would get
desired matches.

Cheers
Avlesh

On Fri, Oct 30, 2009 at 9:19 PM, Nicolas Leconte <ni...@aidel.com>wrote:

> Hi all,
>
> I have a field that contains accentuated char in it, what I whant is to be
> able to search with ignore accents.
> I have set up that field with :
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
> <filter class="solr.SnowballPorterFilterFactory" language="French"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ISOLatin1AccentFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
>
> In the index the word "économie" is translated to  "econom", the  accent is
> removed thanks to the ISOLatin1AccentFilterFactory and the end of the word
> removent thanks to the SnowballPorterFilterFactory.
>
> When I request with title:econ* I can have the correct  answers, but if  I
> request  with  title:écon*  I  have no  answers.
> If I request with title:économ (the exact word of the index) it works, so
> there might be something wrong with the wildcard.
> As far as I can understand the analyser should be use exactly the same in
> both index and query time.
>
> I have tested with changing the order of the filters (putting the
> ISOLatin1AccentFilterFactory on top) without any result.
>
> Could anybody help me with that and point me what may be wrong with my
> shema ?
>

Re: Iso accents and wildcards

Posted by Nicolas Leconte <ni...@aidel.com>.
Tks for the tips, I will try to do exactly what u suggest.

Avlesh Singh a écrit :
>> When I request with title:econ* I can have the correct  answers, but if  I
>> request  with  title:écon*  I  have no  answers.
>> If I request with title:économ (the exact word of the index) it works, so
>> there might be something wrong with the wildcard.
>> As far as I can understand the analyser should be use exactly the same in
>> both index and query time.
>>
>>     
> Wildcard queries are not analyzed and hence the "inconsistent" behaviour.
> The easiest way out is to define one more field "title_orginal" as an
> untokenized field. While querying, you can use both the fields at the same
> time. e.g. q=(title:écon* title_orginal:écon*). In any case, you would get
> desired matches.
>
> Cheers
> Avlesh
>
> On Fri, Oct 30, 2009 at 9:19 PM, Nicolas Leconte <ni...@aidel.com>wrote:
>
>   
>> Hi all,
>>
>> I have a field that contains accentuated char in it, what I whant is to be
>> able to search with ignore accents.
>> I have set up that field with :
>> <analyzer>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> <filter class="solr.StandardFilterFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1" />
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>> <filter class="solr.SnowballPorterFilterFactory" language="French"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.ISOLatin1AccentFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>>
>> In the index the word "économie" is translated to  "econom", the  accent is
>> removed thanks to the ISOLatin1AccentFilterFactory and the end of the word
>> removent thanks to the SnowballPorterFilterFactory.
>>
>> When I request with title:econ* I can have the correct  answers, but if  I
>> request  with  title:écon*  I  have no  answers.
>> If I request with title:économ (the exact word of the index) it works, so
>> there might be something wrong with the wildcard.
>> As far as I can understand the analyser should be use exactly the same in
>> both index and query time.
>>
>> I have tested with changing the order of the filters (putting the
>> ISOLatin1AccentFilterFactory on top) without any result.
>>
>> Could anybody help me with that and point me what may be wrong with my
>> shema ?
>>
>>     
>
>