You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Martin Rode <ma...@programmfabrik.de> on 2005/07/28 18:40:14 UTC

European Languages search problem

Hello everybody,

First of congrats for that great piece of software!


I am working on a Europe-wide project, where we have texts on more than 
one European language, namely French, German, and English. Having tried 
the German and the FrenchAnalyzer both are not satisfying for what I need.

The GermanAnalyer should do a classic German umlaut conversion:

ä -> ae
ö -> oe
u -> ue

It does ä->a, ö->o, ü->u. This is not useful. If a word appears like 
"Oeffner" and i search for "Öffner", i dont find it! It does the 
conversion right for "ß", which converts to "ss".

For French I tried the FrenchAnalyzer, but it does not work (at least 
not the one in lukeall.jar, which is pretty up-to-date, I guess).

Well, in short, it would be nice to have a simple Analyzer which does 
the great job of the StandardAnalyzer, PLUS a few extras for European 
languages, and that is pretty easy:

For German: See above,
For French: Remove all the ` ´ ^ and the hook below the c
For Swedish, Polish, Czech ... remove everything which crosses, slashes 
or whatever the ascci letters a-z

Before I write my own Analyzer for that I was wondering if anybody has 
had the same problems and found already a solution for that!


Thanks for your help!

Take Care,
Martin










---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: European Languages search problem

Posted by Martin Rode <ma...@programmfabrik.de>.
Otis,

Thanks for the quick reply.

The idea to emit multiple tokens is great!

I was looking for a solution of another problem: I want to present a 
word completition list to the user, so I use reader.terms(new 
Term("start","here"). If I start searching at "henrie", the 
reader.terms() should really start at "henriette" which is the next 
term. But it start at "her" in my case, because "henriette" is the only 
term between "henrie"(which is no term) and "her". This is very much not 
useful! May be the developers can at least at an option, so that 
reader.terms(new Term("henrie")) yields "henriette".

Anyway... I need a comletition which includes the Umlauts, so I can just 
emit two tokens and be happy. That should do it, shouldn't it?

Best Regards,
Martin

P.S.: When I am done with my Anaylzer I will contribute it.


Otis Gospodnetic wrote:

>Hi Martin,
>
>When you write your own tokenizer/analyzer for this, you'll probably
>want to emit multiple tokens for words that have umlauts and such - one
>version with ä -> ae, the other with ä -> a perhaps.
>
>As for stripping accents from characters, somebody posted
>ISOLatinFilter.java (I think that was the class name) a few months
>back.
>
>If you can contribute your Analyzers, that would be great, as we
>already have a small set of Analyzers in Lucene's contrib area in SVN.
>
>Otis
>
>
>--- Martin Rode <ma...@programmfabrik.de> wrote:
>
>  
>
>>Hello everybody,
>>
>>First of congrats for that great piece of software!
>>
>>
>>I am working on a Europe-wide project, where we have texts on more
>>than 
>>one European language, namely French, German, and English. Having
>>tried 
>>the German and the FrenchAnalyzer both are not satisfying for what I
>>need.
>>
>>The GermanAnalyer should do a classic German umlaut conversion:
>>
>>ä -> ae
>>ö -> oe
>>u -> ue
>>
>>It does ä->a, ö->o, ü->u. This is not useful. If a word appears like 
>>"Oeffner" and i search for "Öffner", i dont find it! It does the 
>>conversion right for "ß", which converts to "ss".
>>
>>For French I tried the FrenchAnalyzer, but it does not work (at least
>>
>>not the one in lukeall.jar, which is pretty up-to-date, I guess).
>>
>>Well, in short, it would be nice to have a simple Analyzer which does
>>
>>the great job of the StandardAnalyzer, PLUS a few extras for European
>>
>>languages, and that is pretty easy:
>>
>>For German: See above,
>>For French: Remove all the ` ´ ^ and the hook below the c
>>For Swedish, Polish, Czech ... remove everything which crosses,
>>slashes 
>>or whatever the ascci letters a-z
>>
>>Before I write my own Analyzer for that I was wondering if anybody
>>has 
>>had the same problems and found already a solution for that!
>>
>>
>>Thanks for your help!
>>
>>Take Care,
>>Martin
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: European Languages search problem

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Martin,

When you write your own tokenizer/analyzer for this, you'll probably
want to emit multiple tokens for words that have umlauts and such - one
version with ä -> ae, the other with ä -> a perhaps.

As for stripping accents from characters, somebody posted
ISOLatinFilter.java (I think that was the class name) a few months
back.

If you can contribute your Analyzers, that would be great, as we
already have a small set of Analyzers in Lucene's contrib area in SVN.

Otis


--- Martin Rode <ma...@programmfabrik.de> wrote:

> Hello everybody,
> 
> First of congrats for that great piece of software!
> 
> 
> I am working on a Europe-wide project, where we have texts on more
> than 
> one European language, namely French, German, and English. Having
> tried 
> the German and the FrenchAnalyzer both are not satisfying for what I
> need.
> 
> The GermanAnalyer should do a classic German umlaut conversion:
> 
> ä -> ae
> ö -> oe
> u -> ue
> 
> It does ä->a, ö->o, ü->u. This is not useful. If a word appears like 
> "Oeffner" and i search for "Öffner", i dont find it! It does the 
> conversion right for "ß", which converts to "ss".
> 
> For French I tried the FrenchAnalyzer, but it does not work (at least
> 
> not the one in lukeall.jar, which is pretty up-to-date, I guess).
> 
> Well, in short, it would be nice to have a simple Analyzer which does
> 
> the great job of the StandardAnalyzer, PLUS a few extras for European
> 
> languages, and that is pretty easy:
> 
> For German: See above,
> For French: Remove all the ` ´ ^ and the hook below the c
> For Swedish, Polish, Czech ... remove everything which crosses,
> slashes 
> or whatever the ascci letters a-z
> 
> Before I write my own Analyzer for that I was wondering if anybody
> has 
> had the same problems and found already a solution for that!
> 
> 
> Thanks for your help!
> 
> Take Care,
> Martin
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org