You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Olala <ht...@gmail.com> on 2010/02/03 02:53:54 UTC

Search wihthout diacritics

Hi all!

I have problem with Solr, and I hope everyboby in there can help me :)

I want to search text without diacritic but Solr will response diacritic
text and without diacritic text.

For example, I query "solr index", it will response "solr index", "sôlr
index", "sòlr index", "sólr indèx",...

I was tried ASCIIFoldingFilter and ISOLatin1AccentFilterFactory but it is
not correct :(

My schema config:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
         <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>


-- 
View this message in context: http://old.nabble.com/Search-wihthout-diacritics-tp27430345p27430345.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search wihthout diacritics

Posted by Grant Ingersoll <gs...@apache.org>.
On Feb 2, 2010, at 8:53 PM, Olala wrote:

> 
> Hi all!
> 
> I have problem with Solr, and I hope everyboby in there can help me :)
> 
> I want to search text without diacritic but Solr will response diacritic
> text and without diacritic text.
> 
> For example, I query "solr index", it will response "solr index", "sôlr
> index", "sòlr index", "sólr indèx",...
> 
> I was tried ASCIIFoldingFilter and ISOLatin1AccentFilterFactory but it is
> not correct :(

What's not correct?  Can you provide more detail?  Is it not indexed correctly?  You might look at the Analysis tool under the Solr admin area to see how it is processing your content during indexing and searching.

> 
> My schema config:
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>       
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>

You probably should strip diacritics during query time, too.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search


Re: Search wihthout diacritics

Posted by Lance Norskog <go...@gmail.com>.
You need to add AsciiFoldingFilter to the query path as well as the
indexing path.

The solr/admin/analysis.jsp page lets you explore how these analysis
stacks work.

On Tue, Feb 2, 2010 at 5:53 PM, Olala <ht...@gmail.com> wrote:
>
> Hi all!
>
> I have problem with Solr, and I hope everyboby in there can help me :)
>
> I want to search text without diacritic but Solr will response diacritic
> text and without diacritic text.
>
> For example, I query "solr index", it will response "solr index", "sôlr
> index", "sòlr index", "sólr indèx",...
>
> I was tried ASCIIFoldingFilter and ISOLatin1AccentFilterFactory but it is
> not correct :(
>
> My schema config:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
>
> --
> View this message in context: http://old.nabble.com/Search-wihthout-diacritics-tp27430345p27430345.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goksron@gmail.com