You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sascha Fahl <sa...@evenity.net> on 2008/11/18 13:06:43 UTC

Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Hi,
what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae,  
ue, ss during the process of analyzing?

Thanks,


Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: sascha@evenity.net









---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by csantos <cl...@gmail.com>.
Hi,

I'm a newbie with Lucene and I started some testing with Lucene 2.2.0.
I developed my own Analyser and my own Filter based on examples found here,
that is:

public class DiacriticAnalyser extends GermanAnalyzer {
....

 @Override 
 public TokenStream tokenStream(String fieldName, Reader reader) { 
   TokenStream result = super.tokenStream(fieldName, reader);
   result = new ISOLatin1DiacriticFilter(result); 
   return result; 

}

}

public class ISOLatin1DiacriticFilter extends TokenFilter {
...

@Override
public final Token next() throws java.io.IOException {
   final Token t = input.next();
    if (t != null)
      t.setTermText(removeDiacritics(t.termText()));
    return t;
}

}

What i don't understand is: isn't the call to input.next() endless
recursive, i mean the TokenStream class is abstract and the TokenFilter
class doesn't implement next()? And who calls next(), i just call the
constructor of ISOLatin1DiacriticFilter class.

regards,
-- 
View this message in context: http://www.nabble.com/Transforming-german-umlaute-like-%C3%B6%2C%C3%A4%2C%C3%BC%2C%C3%9F-into-oe%2C-ae%2C-ue%2C-ss-tp20558345p20733263.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
 > > Where do I get the CharFilter library? I'm using Lucene, not Solr.
 > >
 > > Thanks,
 > > Sascha
 > CharFilter is included in recent Solr nightly build.
 > It is not OOTB solution for Lucene now, sorry.
 > If I have time, I will make it for Lucene in this weekend.

Now the patch available for Lucene at:
https://issues.apache.org/jira/browse/LUCENE-1466

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Sascha Fahl wrote:
> Where do I get the CharFilter library? I'm using Lucene, not Solr.
>
> Thanks,
> Sascha
CharFilter is included in recent Solr nightly build.
It is not OOTB solution for Lucene now, sorry.
If I have time, I will make it for Lucene in this weekend.

Koji



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Sascha Fahl <sa...@evenity.net>.
Where do I get the CharFilter library? I'm using Lucene, not Solr.

Thanks,
Sascha

Am 18.11.2008 um 14:11 schrieb Koji Sekiguchi:

> Uwe Goetzke wrote:
> > Use ISOLatin1AccentFilter, although it is not perfect...
> > So I made ISOLatin2AccentFilter for me and changed this method.
>
> Or use CharFilter library. It is for Solr as of now, though.
>
> See:
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
> https://issues.apache.org/jira/browse/SOLR-822
>
> Koji
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: sascha@evenity.net









---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Uwe Goetzke wrote:
 > Use ISOLatin1AccentFilter, although it is not perfect...
 > So I made ISOLatin2AccentFilter for me and changed this method.

Or use CharFilter library. It is for Solr as of now, though.

See:
https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
https://issues.apache.org/jira/browse/SOLR-822

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Teruhiko Kurosaka <Ku...@basistech.com>.
Naming this class to include "Latin2" may be misleading.
Latin2 means ISO-8859-2 character set.

http://en.wikipedia.org/wiki/ISO_8859-2


> From: Uwe Goetzke [mailto:uwe.goetzke@healy-hudson.com] 
> Sent: Tuesday, November 18, 2008 7:26 AM
> To: java-user@lucene.apache.org
> Cc: sascha@evenity.net
> Subject: AW: Transforming german umlaute like ö,ä,ü,ß into 
> oe, ae, ue, ss
> 
> Use ISOLatin1AccentFilter, although it is not perfect...
> So I made ISOLatin2AccentFilter for me and changed this method.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Posted by Uwe Goetzke <uw...@healy-hudson.com>.
Use ISOLatin1AccentFilter, although it is not perfect...
So I made ISOLatin2AccentFilter for me and changed this method.
We use our own analysers, so you would use something like this

		result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader);
		result = new ISOLatin2AccentFilter(result);
		result = new org.apache.lucene.analysis.LowerCaseFilter(result);


* To replace accented characters in a String by unaccented equivalents.
	 */
	public final static String removeAccents(String input) {
		final StringBuffer output = new StringBuffer();
		for (int i = 0; i < input.length(); i++) {
			switch (input.charAt(i)) {
				case '\u00C0' : // À
				case '\u00C1' : // Á
				case '\u00C2' : // Â
				case '\u00C3' : // Ã
				case '\u00C5' : // Å
					output.append("A");
					break;
				case '\u00C4' : // Ä
				case '\u00C6' : // Æ
					output.append("AE");
					break;
				case '\u00C7' : // Ç
					output.append("C");
					break;
				case '\u00C8' : // È
				case '\u00C9' : // É
				case '\u00CA' : // Ê
				case '\u00CB' : // Ë
					output.append("E");
					break;
				case '\u00CC' : // Ì
				case '\u00CD' : // Í
				case '\u00CE' : // Î
				case '\u00CF' : // Ï
					output.append("I");
					break;
				case '\u00D0' : // Ð
					output.append("D");
					break;
				case '\u00D1' : // Ñ
					output.append("N");
					break;
				case '\u00D2' : // Ò
				case '\u00D3' : // Ó
				case '\u00D4' : // Ô
				case '\u00D5' : // Õ
				case '\u00D8' : // Ø
					output.append("O");
					break;
				case '\u00D6' : // Ö
				case '\u0152' : // Œ
					output.append("OE");
					break;
				case '\u00DE' : // Þ
					output.append("TH");
					break;
				case '\u00D9' : // Ù
				case '\u00DA' : // Ú
				case '\u00DB' : // Û
					output.append("U");
					break;
				case '\u00DC' : // Ü
					output.append("UE");
					break;
				case '\u00DD' : // Ý
				case '\u0178' : // Ÿ
					output.append("Y");
					break;
				case '\u00E0' : // à
				case '\u00E1' : // á
				case '\u00E2' : // â
				case '\u00E3' : // ã
				case '\u00E5' : // å
					output.append("a");
					break;
				case '\u00E4' : // ä
				case '\u00E6' : // æ
					output.append("ae");
					break;
				case '\u00E7' : // ç
					output.append("c");
					break;
				case '\u00E8' : // è
				case '\u00E9' : // é
				case '\u00EA' : // ê
				case '\u00EB' : // ë
					output.append("e");
					break;
				case '\u00EC' : // ì
				case '\u00ED' : // í
				case '\u00EE' : // î
				case '\u00EF' : // ï
					output.append("i");
					break;
				case '\u00F0' : // ð
					output.append("d");
					break;
				case '\u00F1' : // ñ
					output.append("n");
					break;
				case '\u00F2' : // ò
				case '\u00F3' : // ó
				case '\u00F4' : // ô
				case '\u00F5' : // õ
				case '\u00F8' : // ø
					output.append("o");
					break;
				case '\u00F6' : // ö
				case '\u0153' : // œ
					output.append("oe");
					break;
				case '\u00DF' : // ß
					output.append("ss");
					break;
				case '\u00FE' : // þ
					output.append("th");
					break;
				case '\u00F9' : // ù
				case '\u00FA' : // ú
				case '\u00FB' : // û
					output.append("u");
					break;
				case '\u00FC' : // ü
					output.append("ue");
					break;
				case '\u00FD' : // ý
				case '\u00FF' : // ÿ
					output.append("y");
					break;
				default :
					output.append(input.charAt(i));
					break;
			}
		}
		return output.toString();
	}
}

Regards

Uwe Goetzke
Leiter Produktentwicklung 
Healy Hudson GmbH 
Procurement & Retail Solutions   


-----Ursprüngliche Nachricht-----
Von: Sascha Fahl [mailto:sascha@evenity.net] 
Gesendet: Dienstag, 18. November 2008 13:07
An: java-user@lucene.apache.org
Betreff: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Hi,
what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae,  
ue, ss during the process of analyzing?

Thanks,


Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: sascha@evenity.net









---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or use this information contained in it. If you have received this email in error please tell us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org