You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Peter Pimley <pp...@semantico.com> on 2004/12/22 11:52:27 UTC

(Offtopic) The unicode name for a character

Hi everyone,

The Question:
In Java generally, Is there an easy way to get the unicode name of a 
character?  (e.g. "LATIN SMALL LETTER A" from 'a')


The Reasoning (for those who are interested):
The documents I'm indexing have quite a lot of characters that are 
basically variations on the basic A-Z ones.  In my analysis step, I'd 
like to convert these to their closest equivalent in the basic A-Z set.

For some letters, this is easy.  An example is the e-acute character 
(00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
plain 'e'.  I can do that by using the IBM ICU4J tools to decompose the 
single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then I 
can strip all characters that fail Character.isLetterOrDigit.  That 
works fine.

Some characters however do not decompose.  An example is the character 
01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with 
'P', but it does not decompose into P + something.

I'm considering taking the unicode name for each character I encounter 
and regexping it against something like:
^LATIN .* LETTER (.) WITH .*$
... to try and extract the single A-Z|a-z character.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Chris Hostetter <ho...@fucit.org>.

: However, I don't think that the names are consistent enough to permit a
: generic use of regular expressions. What Daniel is trying to achieve
: looks interesting anyway,

I'm not sure that that really matters in the long run ... I think the OP
was asking if there was a way to get the name in java because he figured
that way he could programaticly determine what the "base" character was in
his application.  But, that doesn't mean he needs to do this
progromatically every time his indexing/searching code sees a character
outside of LATIN-1

it would probably make more sense to write a little one off program that
could read in this file, and then spit out all of the non latin-1
characters with a guess as to which latin-1 character could act as a
substitution (if any) based on the name of the chracter, and a blank for
the user to override.  This program could be run once to generate a nice
small, efficient mapping table that could be (commited to cvs and) reused
over and over.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Pierrick Brihaye <pi...@culture.gouv.fr>.

Hi,

Morus Walter a écrit :

> If you cannot find that list somewhere I can mail you a copy.

ICU4J's one is here :

http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7&content-type=text/x-cvsweb-markup

See also Unicode's one:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

http://pistos.pe.kr/javadocs/etc/icu4j2_4/doc/com/ibm/icu/lang/UCharacter.html#getName(int) 
should also help you.

However, I don't think that the names are consistent enough to permit a 
generic use of regular expressions. What Daniel is trying to achieve 
looks interesting anyway,

Good luck,

-- 
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:pierrick.brihaye@culture.gouv.fr
+33 (0)2 99 29 67 78

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Morus Walter <mo...@tanto.de>.

Hi Peter,
> 
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> 
...
> 
> I'm considering taking the unicode name for each character I encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> 
There used to be a list (ASCII) on some ftp server at unicode.org.
I have a version 'UnicodeData.txt' here.
It lists ~ 12000 characters in the form
01A4;LATIN CAPITAL LETTER P WITH HOOK;Lu;0;L;;;;;N;LATIN CAPITAL LETTER P HOOK;;;01A5;
01A5;LATIN SMALL LETTER P WITH HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER P HOOK;;01A4;;01A4

If you cannot find that list somewhere I can mail you a copy.

It would be a nice contribution if you could add your filter to lucenes
sandbox, once it's finished.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: (Offtopic) The unicode name for a character

Posted by Otis Gospodnetic <ot...@yahoo.com>.

If you are not tied to Java, see 'unac' at http://www.senga.org/.
It's old, but if nothing else you could see how it works and rewrite it
in Java.  And if you can, you can donate it to Lucene Sandbox.

Otis

--- Peter Pimley <pp...@semantico.com> wrote:

> 
> Hi everyone,
> 
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> 
> 
> The Reasoning (for those who are interested):
> The documents I'm indexing have quite a lot of characters that are 
> basically variations on the basic A-Z ones.  In my analysis step, I'd
> 
> like to convert these to their closest equivalent in the basic A-Z
> set.
> 
> For some letters, this is easy.  An example is the e-acute character 
> (00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
> plain 'e'.  I can do that by using the IBM ICU4J tools to decompose
> the 
> single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then
> I 
> can strip all characters that fail Character.isLetterOrDigit.  That 
> works fine.
> 
> Some characters however do not decompose.  An example is the
> character 
> 01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
> 
> 'P', but it does not decompose into P + something.
> 
> I'm considering taking the unicode name for each character I
> encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org