You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by PEP AD Server Administrator <PE...@erl9.siemens.de> on 2004/05/19 17:34:55 UTC

Problem tokenizing UTF-8 with geman umlauts

Hello,
I have HTML-documents which are UTF-8 encoded and contain english and/or
german content. I have written my own Analyser and Filter to replace the
german umlauts with the commonly used pair of character (ü=ue, ä=ae, ö=oe)
to avoid any problems. Still in the HTML-code the german umlauts are shown
as a pair of character representing the UTF-8 encoding (I think). As a
result the StandardTokenizer is missinterpreting the string and splitting a
word with umlaut into 2 tokens which is of no use anymore.
Does anyone ahs experience in this case and can help me back on the road?

Peter MH

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org