You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ian Parkin <ia...@hotmail.com> on 2002/09/18 22:55:57 UTC

How to handle umlauts ?

Hello all,

I suspect my answer will involve unicode, but I'd like to make sure that I 
am going down the right path here.

I have 100,000+ small HTML files that are mainly in the english language. I 
just noticed that we have some user names with umlauts. These are seemingly 
stored and searchable as the '?' character.

My code is based on the demo code that is provided with Lucene, under the 
'demo' directory.

I am wondering what changes I will need to make to handle such characters as 
umlauts within english text ?

Thanks

IAP

_________________________________________________________________
Join the worlds largest e-mail service with MSN Hotmail. 
http://www.hotmail.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: How to handle umlauts ?

Posted by Rémy Amouroux <re...@kelkoo.com>.

Le mer 18/09/2002 à 22:55, Ian Parkin a écrit : 
> Hello all,
> 
> I suspect my answer will involve unicode, but I'd like to make sure that I 
> am going down the right path here.
> 
> I have 100,000+ small HTML files that are mainly in the english language. I 
> just noticed that we have some user names with umlauts. These are seemingly 
> stored and searchable as the '?' character.
> 
> My code is based on the demo code that is provided with Lucene, under the 
> 'demo' directory.
> 
> I am wondering what changes I will need to make to handle such characters as 
> umlauts within english text ?

Try to use an analyzer like StandardAnalyzer in both direction: while
indexing and while searching.
I don't remember if one of the filter used by StandardAnalyzer modifies
the accented letters, if not, you will have to create one.

The idea is to transform every word to a normalized form (removing
common words, removing accents [ü => u], making the word lowercase)
before indexing the word and before searching the word. That way,
someone looking for ümlaut will have the same results than someone
looking for umlaut. (and knowing the lazyness of most of common users,
they will thank oyu to make that posisble :-)

It's quite easy to implement a subclass of
org.apache.lucene.analysis.TokenFilter that will answer to your needs
and to use it in a subclass of org.apache.lucene.analysis.Analyzer
with all the supplementary filters you need to add.

Remy

-- 
E-mail : Remy.Amouroux@kelkoo.com
Kelkoo R&D Director (http://www.kelkoo.com/)


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>