You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by radarghost <ra...@yahoo.com> on 2009/02/18 15:28:43 UTC

foreign characters equivalent in solr search

we are using solr 1.2 and dont want to upgrade to 1.3 till official release
for Debian.
i want solr to search for equivalent of a foreign chracter for getting
better results

in example:

if a user searches for Tiesto which is indexed in this format Tiësto in our
solr. we want solr also return result
return search result for á, à, â, ä, ã, å where they are in word but that
word has been searched with normal a
e for ë, i for ï, o for ö, and so on

any solution?

hope i could tell what i need with my poor English

thanks


-- 
View this message in context: http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: foreign characters equivalent in solr search

Posted by Chris Hostetter <ho...@fucit.org>.

: if a user searches for Tiesto which is indexed in this format Tiësto in our
: solr. we want solr also return result

This is what the ISOLatin1AccentFilter is for.  It's been included in Solr 
since 1.1.

It's been deprecated in favor of the newer ASCIIFoldingFilter which does 
a better job with other charsets, but all of you examples seem to be 
Latin1 chars so i'm guessing it will probably work pretty well in your 
cases.



-Hoss

Re: foreign characters equivalent in solr search

Posted by AHMET ARSLAN <io...@yahoo.com>.

> we will try that and post the results here but it seems we
> may get problem with highlight function.

No highlighting works fine with that. I am also using similar filter for turkish chars. I replace ç with c, ş with s and so on at index time. 

Another (easier but less efficient ) way to implement this filter is to extend org.apache.lucene.index.memory.SynonymMap and override public String[] getSynonyms(String word) method. In this case your getSynonyms method will return either new String[0] or new String[1]. Constructor will invoke super(null); without problems.

After that you can use your custom SynonymMap in your Lucene's SynonymTokenFilter constructor. (without modifying SynonymTokenFilter)

stream = new SynonymTokenFilter(stream, new MySynonymMap(), Integer.MAX_VALUE);

Because SynonymTokenFilter invokes only getSynonyms method of SynonymMap.

Re: foreign characters equivalent in solr search

Posted by radarghost <ra...@yahoo.com>.

thanks

we will try that and post the results here but it seems we may get problem
with highlight function.



Ahmet Arslan wrote:
> 
> I think best way to do this is to modify
> org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter
> index time.
> 
> if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will
> replace it with its equvalent ascii character (a). Then you will inject
> this new Token as a Synonym.
> 
> I don't know is it the best way but it will give you what you want.
> 
> --- On Wed, 2/18/09, radarghost <ra...@yahoo.com> wrote:
> 
>> From: radarghost <ra...@yahoo.com>
>> Subject: foreign characters equivalent in solr search
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, February 18, 2009, 4:28 PM
>> we are using solr 1.2 and dont want to upgrade to 1.3 till
>> official release
>> for Debian.
>> i want solr to search for equivalent of a foreign chracter
>> for getting
>> better results
>> 
>> in example:
>> 
>> if a user searches for Tiesto which is indexed in this
>> format Tiësto in our
>> solr. we want solr also return result
>> return search result for á, à, â, ä, ã, å where they
>> are in word but that
>> word has been searched with normal a
>> e for ë, i for ï, o for ö, and so on
>> 
>> any solution?
>> 
>> hope i could tell what i need with my poor English
>> 
>> thanks
>> 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
>> Sent from the Solr - User mailing list archive at
>> Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22095325.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: foreign characters equivalent in solr search

Posted by radarghost <ra...@yahoo.com>.

it may takes too long for Solr 1.4

any other solution for Solr 1.2?

anyway thanks for the reply.


Koji Sekiguchi-2 wrote:
> 
> CharFilter will solve the problem, but it comes with Solr 1.4.
> 
> https://issues.apache.org/jira/browse/SOLR-822
> 
> Koji
> 
> AHMET ARSLAN wrote:
>> I think best way to do this is to modify
>> org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter
>> index time.
>>
>> if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you
>> will replace it with its equvalent ascii character (a). Then you will
>> inject this new Token as a Synonym.
>>
>> I don't know is it the best way but it will give you what you want.
>>
>> --- On Wed, 2/18/09, radarghost <ra...@yahoo.com> wrote:
>>
>>   
>>> From: radarghost <ra...@yahoo.com>
>>> Subject: foreign characters equivalent in solr search
>>> To: solr-user@lucene.apache.org
>>> Date: Wednesday, February 18, 2009, 4:28 PM
>>> we are using solr 1.2 and dont want to upgrade to 1.3 till
>>> official release
>>> for Debian.
>>> i want solr to search for equivalent of a foreign chracter
>>> for getting
>>> better results
>>>
>>> in example:
>>>
>>> if a user searches for Tiesto which is indexed in this
>>> format Tiësto in our
>>> solr. we want solr also return result
>>> return search result for á, à, â, ä, ã, å where they
>>> are in word but that
>>> word has been searched with normal a
>>> e for ë, i for ï, o for ö, and so on
>>>
>>> any solution?
>>>
>>> hope i could tell what i need with my poor English
>>>
>>> thanks
>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
>>> Sent from the Solr - User mailing list archive at
>>> Nabble.com.
>>>     
>>
>>
>>       
>>
>>   
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22095354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: foreign characters equivalent in solr search

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

CharFilter will solve the problem, but it comes with Solr 1.4.

https://issues.apache.org/jira/browse/SOLR-822

Koji

AHMET ARSLAN wrote:
> I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.
>
> if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.
>
> I don't know is it the best way but it will give you what you want.
>
> --- On Wed, 2/18/09, radarghost <ra...@yahoo.com> wrote:
>
>   
>> From: radarghost <ra...@yahoo.com>
>> Subject: foreign characters equivalent in solr search
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, February 18, 2009, 4:28 PM
>> we are using solr 1.2 and dont want to upgrade to 1.3 till
>> official release
>> for Debian.
>> i want solr to search for equivalent of a foreign chracter
>> for getting
>> better results
>>
>> in example:
>>
>> if a user searches for Tiesto which is indexed in this
>> format Tiësto in our
>> solr. we want solr also return result
>> return search result for á, à, â, ä, ã, å where they
>> are in word but that
>> word has been searched with normal a
>> e for ë, i for ï, o for ö, and so on
>>
>> any solution?
>>
>> hope i could tell what i need with my poor English
>>
>> thanks
>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
>> Sent from the Solr - User mailing list archive at
>> Nabble.com.
>>     
>
>
>       
>
>

Re: foreign characters equivalent in solr search

Posted by AHMET ARSLAN <io...@yahoo.com>.

I think best way to do this is to modify org.apache.lucene.index.memory.SynonymTokenFilter and employ this filter index time.

if token.termBuffer() has one those (á, à, â, ä, ã, å) characters you will replace it with its equvalent ascii character (a). Then you will inject this new Token as a Synonym.

I don't know is it the best way but it will give you what you want.

--- On Wed, 2/18/09, radarghost <ra...@yahoo.com> wrote:

> From: radarghost <ra...@yahoo.com>
> Subject: foreign characters equivalent in solr search
> To: solr-user@lucene.apache.org
> Date: Wednesday, February 18, 2009, 4:28 PM
> we are using solr 1.2 and dont want to upgrade to 1.3 till
> official release
> for Debian.
> i want solr to search for equivalent of a foreign chracter
> for getting
> better results
> 
> in example:
> 
> if a user searches for Tiesto which is indexed in this
> format Tiësto in our
> solr. we want solr also return result
> return search result for á, à, â, ä, ã, å where they
> are in word but that
> word has been searched with normal a
> e for ë, i for ï, o for ö, and so on
> 
> any solution?
> 
> hope i could tell what i need with my poor English
> 
> thanks
> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/foreign-characters-equivalent-in-solr-search-tp22079912p22079912.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.