You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Thomas Scheffler <th...@uni-jena.de> on 2004/01/27 11:57:26 UTC

umlaut normalisation

Hi,

is that possible with lucene to use umlaut normalisation?
For example Query: Hühnerstall --> Query: Huehnerstall.

This ofcause includes that the document was indexed with normalized umlauts.
This issue is very important, because not every one starting a search
against german documents may have a german keyboard.

This brings me to the next problem. Currently only Luke delivers result
for "Hühnerstall", my selfed implemented solution allways makes
"huhnerstall" out of it in the query (Why?). But ther is no "huhnerstall"
indexed.

regards Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: umlaut normalisation

Posted by Ulrich Mayring <ul...@denic.de>.

prolog_tutor@gmx.de wrote:
> 
> You can write your own Filter that checks for umlauts and substitutes them 
> with 'ue' etc. It can be done very easily, just  take a look at
> GermanStemFilter and
> GermanStemmer (method 'substitute') to get an idea.

We're using the Snowball Stemmers for German in a way that we can search 
for "ümläut" or "umlaut" and get the same result.

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: umlaut normalisation

Posted by pr...@gmx.de.

Hi Thomas,

> >> For example Query: Hühnerstall --> Query: Huehnerstall.

> I thought it would be allready available somehow since it's supported in
> other major textsearch engines for example NSE from IBM, why not in
> lucene?

You can write your own Filter that checks for umlauts and substitutes them 
with 'ue' etc. It can be done very easily, just  take a look at
GermanStemFilter and
GermanStemmer (method 'substitute') to get an idea.

Some of the articles available via
http://jakarta.apache.org/lucene/docs/resources.html
might also be interesting.

Best regards,

René

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: umlaut normalisation

Posted by Markus Spath <ms...@arcor.de>.

Thomas Scheffler wrote:

> So there is one query part with the WhiteSpaceAnalyzer and the other with
> GermanAnalyzer. But I dont' know why Hühnerstall get's to huhnerstall.

its because the GermanAnalyzer applies the GermanStemmer (via the 
GermanStemFilter) which substitutes umlauts with their non-umlaut conterparts

from org.apache.lucene.analysis.de.GermanStemmer :

     /**
      * Do some substitutions for the term to reduce overstemming:
      *
      * - Substitute Umlauts with their corresponding vowel: äöü -> aou,
      *   "ß" is substituted by "ss"
      * - Substitute a second char of an pair of equal characters with
      *   an asterisk: ?? -> ?*
      * - Substitute some common character combinations with a token:
      *   sch/ch/ei/ie/ig/st -> $/§/%/&/#/!
      *
      * @return  The term with all needed substitutions.
      */
     private StringBuffer substitute( StringBuffer buffer ) {
     ...

if you want your Analyzer to produce Tokens like 'huehnerstall' probably the 
easiest option is to start with the GermanAnalyzer and add a UmlautFilter 
before the GermanStemFilter is applied.

markus


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: umlaut normalisation

Posted by Thomas Scheffler <th...@uni-jena.de>.

Andrzej Bialecki sagte:
> Thomas Scheffler wrote:
>
>> Hi,
>>
>> is that possible with lucene to use umlaut normalisation?
>> For example Query: Hühnerstall --> Query: Huehnerstall.
>>
>> This ofcause includes that the document was indexed with normalized
>> umlauts.
>> This issue is very important, because not every one starting a search
>> against german documents may have a german keyboard.
>
> It seems to me the best place would be to put this replacement in a
> custom Analyzer (perhaps extend GermanAnalyzer?).

I thought it would be allready available somehow since it's supported in
other major textsearch engines for example NSE from IBM, why not in
lucene?

>
>> This brings me to the next problem. Currently only Luke delivers result
>> for "Hühnerstall", my selfed implemented solution allways makes
>> "huhnerstall" out of it in the query (Why?). But ther is no
>> "huhnerstall"
>> indexed.
>>
>
> Please check which Analyzer you're using in each case.
>

DEBUG Query: MyCoReDemoDC_derivate_0014-->Hühnerstall
DEBUG Set DerivateID to MyCoReDemoDC_derivate_0014 for next query...
DEBUG parsing query using: org.apache.lucene.analysis.de.GermanAnalyzer
DEBUG adding clause: content:huhnerstall
DEBUG preparsed query:(+DerivateID:MyCoReDemoDC_derivate_0014
+content:huhnerstall)

It's the GermanAnalyzer. It doesn't matter what I choose in luke it will
allways find documents for "Hühnerstall", but I'm not able to find it the
self programmed way. My extended QueryParser overwrites parse. It put out
the Analyzer and then parses the String with super.parse(String). The
resulting Query is put in a BooleanClause and later combined withe the
first part (fieldquery using WhiteSpaceAnalyzer) you see above to a new
Query.
So there is one query part with the WhiteSpaceAnalyzer and the other with
GermanAnalyzer. But I dont' know why Hühnerstall get's to huhnerstall.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: umlaut normalisation

Posted by Andrzej Bialecki <ab...@getopt.org>.

Thomas Scheffler wrote:

> Hi,
> 
> is that possible with lucene to use umlaut normalisation?
> For example Query: Hühnerstall --> Query: Huehnerstall.
> 
> This ofcause includes that the document was indexed with normalized umlauts.
> This issue is very important, because not every one starting a search
> against german documents may have a german keyboard.

It seems to me the best place would be to put this replacement in a 
custom Analyzer (perhaps extend GermanAnalyzer?).

> This brings me to the next problem. Currently only Luke delivers result
> for "Hühnerstall", my selfed implemented solution allways makes
> "huhnerstall" out of it in the query (Why?). But ther is no "huhnerstall"
> indexed.
>

Please check which Analyzer you're using in each case.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org