You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Manuel Albela Miranda <al...@3.14financial.com> on 2007/02/01 12:37:18 UTC

Searching with accents

Hello everybody,

Do you know if there is a way to search with and without accents without 
  duplicate a field?.

I have a large index (60Gb) and don't want to have two fields with the 
same content one with accents and the other one without them because 
this field is the biggest in the index.

Again, hope you can help me.

Thank you very much.

Regards.

Manu


Re: Searching with accents

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Manuel Albela Miranda <al...@3.14financial.com> wrote:
> I've never indexed with solr, so the only way to get what i want is to re-index using Solr with the next lines:
>
> <fieldtype name="stringSimilar" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
> >         <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >       </analyzer>

The key is to put the accent filter in your fieldtype definition
somehwere... you may not want exactly this fieldtype definition.  For
example, you might want to do some stemming, or removal of stopwords,
etc.

Second, define the field in the schema to use the new fieldtype you
defined (or just change your existing fieldtype).

-Yonik

Re: Searching with accents

Posted by Manuel Albela Miranda <al...@3.14financial.com>.
Yonik Seeley wrote:
> On 2/1/07, Manuel Albela Miranda <al...@3.14financial.com> wrote:
>> Yes, i was considering that, but there is a problem. If i remove the
>> accents into the index, when i get the results of a search they will not
>> have those accents so results will no be good enough.
>
> Stored fields aren't altered, so you will still get the accents back.
> Just use the accent filter, re-index your collection, and then
> everything should be OK.
>
> -Yonik
>
>
> .
>
Hi Yonik,


I've never indexed with solr, so the only way to get what i want is to re-index using Solr with the next lines:

<fieldtype name="stringSimilar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>       </analyzer>

Hope this works. Thank you!

Regards

Manu.


Re: Searching with accents

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-02-01 at 12:20 -0500, Yonik Seeley wrote:
> On 2/1/07, Manuel Albela Miranda <al...@3.14financial.com> wrote:
> > Yes, i was considering that, but there is a problem. If i remove the
> > accents into the index, when i get the results of a search they will not
> > have those accents so results will no be good enough.
> 
> Stored fields aren't altered, so you will still get the accents back.
> Just use the accent filter, re-index your collection, and then
> everything should be OK.

I agree with Yonik, out of experience. In my actual project I am working
for the Junta de Andalucia in Spain. So I have lots of accents. I index
with the schema I passed you and the content will be stored as is (no
alterations in writing).

So reindexing would keep your markup (if it is utf-8) and content as is.

salu2
-- 
Thorsten Scherler                       thorsten.at.apache.org
Open Source Java & XML      consulting, training and solutions


Re: Searching with accents

Posted by Yonik Seeley <yo...@apache.org>.
On 2/1/07, Manuel Albela Miranda <al...@3.14financial.com> wrote:
> Yes, i was considering that, but there is a problem. If i remove the
> accents into the index, when i get the results of a search they will not
> have those accents so results will no be good enough.

Stored fields aren't altered, so you will still get the accents back.
Just use the accent filter, re-index your collection, and then
everything should be OK.

-Yonik

Re: Searching with accents

Posted by Manuel Albela Miranda <al...@3.14financial.com>.
Thorsten Scherler wrote:
> On Thu, 2007-02-01 at 16:35 +0100, Manuel Albela Miranda wrote:
>   
>> Thorsten Scherler wrote:
>>     
>>> On Thu, 2007-02-01 at 12:37 +0100, Manuel Albela Miranda wrote:
>>>   
>>>       
>>>> Hello everybody,
>>>>
>>>> Do you know if there is a way to search with and without accents without 
>>>>   duplicate a field?.
>>>>
>>>> I have a large index (60Gb) and don't want to have two fields with the 
>>>> same content one with accents and the other one without them because 
>>>> this field is the biggest in the index.
>>>>
>>>> Again, hope you can help me.
>>>>     
>>>>         
>>> Try something like this in your schema.xml:
>>> <fieldtype name="stringSimilar" class="solr.TextField"
>>> positionIncrementGap="100">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>>>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>>>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>>>       </analyzer>
>>>     </fieldtype>
>>>
>>> HTH
>>>
>>> salu2
>>>
>>>   
>>>       
>>>> Thank you very much.
>>>>
>>>> Regards.
>>>>
>>>> Manu
>>>>
>>>>     
>>>>         
>> Hi Thorsten,
>>
>> First of all, thank you for your message. I've working around the 
>> schema.xml file with the lines you sent me. Now i can filter the query, 
>> but the problem is that i have accents in my index so, when i search for 
>> words with accents, solr only search for the word without them and i 
>> need both of them. I don't know if there is a way to do this.
>>     
>
> Well, it is not nice but you could use fuzzy search.
>
> AKA q=Órden~075
>
> That will find more matches. See recent threads around fuzzy search.
>
> The above schema patch is working nice if you update your index (index
> everything again), but what you would need is to reindex the WHOLE 60Gb.
>
> salu2
>
>   
Yes, i was considering that, but there is a problem. If i remove the 
accents into the index, when i get the results of a search they will not 
have those accents so results will no be good enough.

I have to see the performance of the fuzzy search, but i don't think it 
would work for me.

Thank you again.

Regards.

Manu.

Re: Searching with accents

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-02-01 at 16:35 +0100, Manuel Albela Miranda wrote:
> Thorsten Scherler wrote:
> > On Thu, 2007-02-01 at 12:37 +0100, Manuel Albela Miranda wrote:
> >   
> >> Hello everybody,
> >>
> >> Do you know if there is a way to search with and without accents without 
> >>   duplicate a field?.
> >>
> >> I have a large index (60Gb) and don't want to have two fields with the 
> >> same content one with accents and the other one without them because 
> >> this field is the biggest in the index.
> >>
> >> Again, hope you can help me.
> >>     
> >
> > Try something like this in your schema.xml:
> > <fieldtype name="stringSimilar" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
> >         <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
> >         <filter class="solr.ISOLatin1AccentFilterFactory"/>
> >       </analyzer>
> >     </fieldtype>
> >
> > HTH
> >
> > salu2
> >
> >   
> >> Thank you very much.
> >>
> >> Regards.
> >>
> >> Manu
> >>
> >>     
> Hi Thorsten,
> 
> First of all, thank you for your message. I've working around the 
> schema.xml file with the lines you sent me. Now i can filter the query, 
> but the problem is that i have accents in my index so, when i search for 
> words with accents, solr only search for the word without them and i 
> need both of them. I don't know if there is a way to do this.

Well, it is not nice but you could use fuzzy search.

AKA q=Órden~075

That will find more matches. See recent threads around fuzzy search.

The above schema patch is working nice if you update your index (index
everything again), but what you would need is to reindex the WHOLE 60Gb.

salu2

> 
> Regards.
> 
> Manu.
> 
-- 
Thorsten Scherler                       thorsten.at.apache.org
Open Source Java & XML      consulting, training and solutions


RE: Searching with accents

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
Within Lucene the solution is to index the accented and unaccented
versions of the word at the same position (i.e. without incrementing the
position counter).  Perhaps this could be added as an option to the
ISOLatin1AccentFilter? Or perhaps it's already there?

Peter

-----Original Message-----
From: Manuel Albela Miranda [mailto:albela@3.14financial.com] 
Sent: Thursday, February 01, 2007 8:35 AM
To: solr-dev@lucene.apache.org
Subject: Re: Searching with accents

Thorsten Scherler wrote:
> On Thu, 2007-02-01 at 12:37 +0100, Manuel Albela Miranda wrote:
>   
>> Hello everybody,
>>
>> Do you know if there is a way to search with and without accents
without 
>>   duplicate a field?.
>>
>> I have a large index (60Gb) and don't want to have two fields with 
>> the same content one with accents and the other one without them 
>> because this field is the biggest in the index.
>>
>> Again, hope you can help me.
>>     
>
> Try something like this in your schema.xml:
> <fieldtype name="stringSimilar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>       </analyzer>
>     </fieldtype>
>
> HTH
>
> salu2
>
>   
>> Thank you very much.
>>
>> Regards.
>>
>> Manu
>>
>>     
Hi Thorsten,

First of all, thank you for your message. I've working around the
schema.xml file with the lines you sent me. Now i can filter the query,
but the problem is that i have accents in my index so, when i search for
words with accents, solr only search for the word without them and i
need both of them. I don't know if there is a way to do this.

Regards.

Manu.


Re: Searching with accents

Posted by Manuel Albela Miranda <al...@3.14financial.com>.
Thorsten Scherler wrote:
> On Thu, 2007-02-01 at 12:37 +0100, Manuel Albela Miranda wrote:
>   
>> Hello everybody,
>>
>> Do you know if there is a way to search with and without accents without 
>>   duplicate a field?.
>>
>> I have a large index (60Gb) and don't want to have two fields with the 
>> same content one with accents and the other one without them because 
>> this field is the biggest in the index.
>>
>> Again, hope you can help me.
>>     
>
> Try something like this in your schema.xml:
> <fieldtype name="stringSimilar" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.LowerCaseTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>       </analyzer>
>     </fieldtype>
>
> HTH
>
> salu2
>
>   
>> Thank you very much.
>>
>> Regards.
>>
>> Manu
>>
>>     
Hi Thorsten,

First of all, thank you for your message. I've working around the 
schema.xml file with the lines you sent me. Now i can filter the query, 
but the problem is that i have accents in my index so, when i search for 
words with accents, solr only search for the word without them and i 
need both of them. I don't know if there is a way to do this.

Regards.

Manu.


Re: Searching with accents

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-02-01 at 12:37 +0100, Manuel Albela Miranda wrote:
> Hello everybody,
> 
> Do you know if there is a way to search with and without accents without 
>   duplicate a field?.
> 
> I have a large index (60Gb) and don't want to have two fields with the 
> same content one with accents and the other one without them because 
> this field is the biggest in the index.
> 
> Again, hope you can help me.

Try something like this in your schema.xml:
<fieldtype name="stringSimilar" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
      </analyzer>
    </fieldtype>

HTH

salu2

> 
> Thank you very much.
> 
> Regards.
> 
> Manu
> 
-- 
Thorsten Scherler                       thorsten.at.apache.org
Open Source Java & XML      consulting, training and solutions