You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jonathan O'Connor <jo...@xcom.de> on 2005/03/01 12:18:55 UTC

Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

Jon,
I too found some problems with the German analyser recently. Here's what 
may help:
1. You can try reading Joerg Caumanns' paper "A Fast and Simple Stemming 
Algorithm for German Words". This paper describes the algorithm 
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although you 
would want to be indexing well written German and not emails or text 
messages!
3. The German Stemmer converts umlauts into some funny form (the code is a 
bit tricky, and I didn't spend any time looking at it), so maybe thats why 
you can't find umlauts properly. I think the main reason for this umlaut 
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser 
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on Luke. I 
just got it last week, and its brilliant. It shows you everything about 
your indexes. You can also feed text to an Analyser, and see what it makes 
of it. This will show you the real reason why your umlaut search is 
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin



"Jon Humble" <jo...@tecsphere.com> 
01/03/2005 09:35
Please respond to
"Lucene Users List" <lu...@jakarta.apache.org>


To
<lu...@jakarta.apache.org>
cc

Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






Hello,
 
We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
 
(1)     Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2)     With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results. 
(3)     In a similar vein to (2), wildcard searches with escaped special
characters fail to find results. So a search for co\-operative works but
a search for co\-op* fails.
 
I will be grateful for any light that can be shed on these problems.
 
With Thanks,
 
Jon.
 
Jon Humble
BSc (hons,)
Software Engineer
eMail: jon.humble@tecsphere.com

TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
 
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com
 
 




*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in Berlin (05.03.2005) 
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

Posted by Jonathan O'Connor <jo...@xcom.de>.
Apologies Erik,
This must be one of those apostrophe in email address problems I always 
get. Recently I removed the apostrophe from the email address I give out.
Our server recognizes both email addresses, but some of these mail lists 
don't like the O'Connor clann!
Ciao,
Jonathan O'Connor
XCOM Dublin



Erik Hatcher <er...@ehatchersolutions.com> 
01/03/2005 12:16
Please respond to
"Lucene Users List" <lu...@jakarta.apache.org>


To
"Lucene Users List" <lu...@jakarta.apache.org>
cc

Subject
Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]






I had to moderate both Jonathan and Jon's messages in to the list. 
Please subscribe to the list and post to it with the address you've 
subscribed.  I cannot always guarantee I'll catch moderation messages 
and send them through in a timely fashion.

                 Erik

On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:

> Jon,
> I too found some problems with the German analyser recently. Here's 
> what
> may help:
> 1. You can try reading Joerg Caumanns' paper "A Fast and Simple 
> Stemming
> Algorithm for German Words". This paper describes the algorithm
> implemented by GermanAnalyser.
> 2. I guess German nouns all capitalized, so maybe that's why. Although 
> you
> would want to be indexing well written German and not emails or text
> messages!
> 3. The German Stemmer converts umlauts into some funny form (the code 
> is a
> bit tricky, and I didn't spend any time looking at it), so maybe thats 
> why
> you can't find umlauts properly. I think the main reason for this 
> umlaut
> change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
> (that ae is a umlaut).
>
> Finally, to really understand what's happening, get your hands on 
> Luke. I
> just got it last week, and its brilliant. It shows you everything about
> your indexes. You can also feed text to an Analyser, and see what it 
> makes
> of it. This will show you the real reason why your umlaut search is
> failing.
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
>
>
>
> "Jon Humble" <jo...@tecsphere.com>
> 01/03/2005 09:35
> Please respond to
> "Lucene Users List" <lu...@jakarta.apache.org>
>
>
> To
> <lu...@jakarta.apache.org>
> cc
>
> Subject
> Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
>
>
>
>
>
>
> Hello,
>
> We?re using the GermanAnalyzer/Stemmer to index/search our (German)
> Website.
> I have a few questions:
>
> (1)     Why is the GermanAnalyzer case-sensitive? None of the other
> language indexers seem to be. What does this feature add?
> (2)     With the German Analyzer, wildcard searches containing extended
> German characters do not seem to work. So, a* is fine but anä* or ö*
> always find zero results.
> (3)     In a similar vein to (2), wildcard searches with escaped 
> special
> characters fail to find results. So a search for co\-operative works 
> but
> a search for co\-op* fails.
>
> I will be grateful for any light that can be shed on these problems.
>
> With Thanks,
>
> Jon.
>
> Jon Humble
> BSc (hons,)
> Software Engineer
> eMail: jon.humble@tecsphere.com
>
> TecSphere Ltd
> Centre for Advanced Industry
> Coble Dene, Royal Quays
> Newcastle upon Tyne NE29 6DE
> United Kingdom
>
> Direct Dial: +44 (191) 270 31 06
> Fax: +44 (191) 270 31 09
> http://www.tecsphere.com
>
>
>
>
>
>
> *** Aktuelle Veranstaltungen der XCOM AG ***
>
> XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
> Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in 
> Berlin (05.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist 
> allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 
> Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail 
> untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich 
> vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole 
> use of the intended recipient. Any review, distribution by others or 
> forwarding without express permission is strictly prohibited. If you 
> are not the intended recipient, please contact the sender and delete 
> all copies.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org





*** Aktuelle Veranstaltungen der XCOM AG ***

XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in Berlin (05.03.2005) 
Anmeldung und Information unter http://lotus.xcom.de/events


*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I had to moderate both Jonathan and Jon's messages in to the list.  
Please subscribe to the list and post to it with the address you've 
subscribed.  I cannot always guarantee I'll catch moderation messages 
and send them through in a timely fashion.

	Erik

On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:

> Jon,
> I too found some problems with the German analyser recently. Here's 
> what
> may help:
> 1. You can try reading Joerg Caumanns' paper "A Fast and Simple 
> Stemming
> Algorithm for German Words". This paper describes the algorithm
> implemented by GermanAnalyser.
> 2. I guess German nouns all capitalized, so maybe that's why. Although 
> you
> would want to be indexing well written German and not emails or text
> messages!
> 3. The German Stemmer converts umlauts into some funny form (the code 
> is a
> bit tricky, and I didn't spend any time looking at it), so maybe thats 
> why
> you can't find umlauts properly. I think the main reason for this 
> umlaut
> change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
> (that ae is a umlaut).
>
> Finally, to really understand what's happening, get your hands on 
> Luke. I
> just got it last week, and its brilliant. It shows you everything about
> your indexes. You can also feed text to an Analyser, and see what it 
> makes
> of it. This will show you the real reason why your umlaut search is
> failing.
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
>
>
>
> "Jon Humble" <jo...@tecsphere.com>
> 01/03/2005 09:35
> Please respond to
> "Lucene Users List" <lu...@jakarta.apache.org>
>
>
> To
> <lu...@jakarta.apache.org>
> cc
>
> Subject
> Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
>
>
>
>
>
>
> Hello,
>
> We?re using the GermanAnalyzer/Stemmer to index/search our (German)
> Website.
> I have a few questions:
>
> (1)     Why is the GermanAnalyzer case-sensitive? None of the other
> language indexers seem to be. What does this feature add?
> (2)     With the German Analyzer, wildcard searches containing extended
> German characters do not seem to work. So, a* is fine but anä* or ö*
> always find zero results.
> (3)     In a similar vein to (2), wildcard searches with escaped 
> special
> characters fail to find results. So a search for co\-operative works 
> but
> a search for co\-op* fails.
>
> I will be grateful for any light that can be shed on these problems.
>
> With Thanks,
>
> Jon.
>
> Jon Humble
> BSc (hons,)
> Software Engineer
> eMail: jon.humble@tecsphere.com
>
> TecSphere Ltd
> Centre for Advanced Industry
> Coble Dene, Royal Quays
> Newcastle upon Tyne NE29 6DE
> United Kingdom
>
> Direct Dial: +44 (191) 270 31 06
> Fax: +44 (191) 270 31 09
> http://www.tecsphere.com
>
>
>
>
>
>
> *** Aktuelle Veranstaltungen der XCOM AG ***
>
> XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
> Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in 
> Berlin (05.03.2005)
> Anmeldung und Information unter http://lotus.xcom.de/events
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist 
> allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 
> Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail 
> untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich 
> vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole 
> use of the intended recipient. Any review, distribution by others or 
> forwarding without express permission is strictly prohibited. If you 
> are not the intended recipient, please contact the sender and delete 
> all copies.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org