You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Milind Gadre <mi...@ecplatforms.com> on 2001/01/09 00:13:23 UTC

Parsing Latin-1 entities

Hello all,

I have an XML file with entities representing iso-8859-1 characters:

    <some-tag> &Auml; </some-tag>

I would like for Xerces to parse the fragment above, and return the
*Unicode* character corresponding to &Auml; (a-umlaut). This does not
seem to be possible with Node.getNodeValue, which simply seems to return
an empty string.

In my DTD, I have an ENTITY definition such as

    <!ENTITY Auml "&#38;#196;">

Any idea what I am doing wrong, or if what I want to do is possible at
all?

Final question: is there a 'public' DTD that lists all these pesky
entities?

Thanks in advance.

Regards...

Milind Gadre
ecPlatforms, Inc
901 Mariner's Island Blvd, Suite 565
San Mateo, CA 94404
C: 510-919-0596
F: 815-352-0779
milind@ecplatforms.com

Re: Parsing Latin-1 entities

Posted by Mikael St�ldal <mi...@ingen.reklam.staldal.nu>.

In article <3A...@apache.org>, Andy Clark <an...@apache.org> wrote:

>&Auml; is an unknown entity for XML. You have to have that
>entity declared in your grammar for it to work correctly
>(which I see that you do). In validating mode, the parser 
>will report this error to your registered error handler.

It actually reports this error even in non-validating mode.

BTW, it is possible to make Xerces (in non-validating mode) parse a
file with undeclared entity references without errors, and report these
enitity references with SAX ContentHandler.skippedEntity() event?

-- 
/****************************************************************\
* You have just read a message from Mikael Ståldal.              *
*                                                                *
* Remove "ingen.reklam." from the address before mail replying.  *
\****************************************************************/

Re: Parsing Latin-1 entities

Posted by Milind Gadre <mi...@ecplatforms.com>.

Andy, thanks again.

> Perhaps you are calling getNodeValue() from the wrong node
> in the document. Make sure that you do call it on the text
> node children of the "some-tag" element.

I am looking at every node under the some-tag element. However the
unicode character corresponding to a-umlaut does not get returned.

Re: Parsing Latin-1 entities

Posted by Andy Clark <an...@apache.org>.

Milind Gadre wrote:
>     <some-tag> &Auml; </some-tag>
> 
> I would like for Xerces to parse the fragment above, and return the
> *Unicode* character corresponding to &Auml; (a-umlaut). This does not
> seem to be possible with Node.getNodeValue, which simply seems to return
> an empty string.

&Auml; is an unknown entity for XML. You have to have that
entity declared in your grammar for it to work correctly
(which I see that you do). In validating mode, the parser 
will report this error to your registered error handler.

> In my DTD, I have an ENTITY definition such as
> 
>     <!ENTITY Auml "&#38;#196;">
> 
> Any idea what I am doing wrong, or if what I want to do is possible at
> all?

Perhaps you are calling getNodeValue() from the wrong node
in the document. Make sure that you do call it on the text
node children of the "some-tag" element.

> Final question: is there a 'public' DTD that lists all these pesky
> entities?

Check out XHTML at W3C. I think they include all of the HTML
entities in a seperate DTD entity for general use.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org