You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Sander van Zoest <sa...@covalent.net> on 2000/09/12 07:00:15 UTC

Mailing List Archive

Hi,

	<http://archive.covalent.net/>

sort of says it all;  I kind of took over Dirk's archives at
<http://xml-archives.webweaving.org> and rewrote them completely in xml/xsl.

Covalent has donated a machine and is willing to supply bandwidth 
and some of my time.

Would you guys a) not get mad at me for snarfing the feather from
modules.apache.org and b) would it be an idea to link this new
site from the various apache sites?

Cheers,

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Input encoding and character entities

Posted by Dean Roddey <dr...@charmedquark.com>.

The character references in XML indicate Unicode code points. So they don't
represent characters in the source code page, they directly represent
Unicode. So a numerical char refer of X becomes exactly X in the internal
representation.

--------------------------
Dean Roddey
The CIDLib C++ Frameworks
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"It takes two buttocks to make friction"
    - African Proverb


----- Original Message -----
From: "Daniel Schroeder" <da...@mozquito.com>
To: <xe...@xml.apache.org>
Sent: Tuesday, September 12, 2000 10:47 AM
Subject: Input encoding and character entities


>
> I have an application that reads xml files with various input encodings.
> Today, I stumbled across some behaviour I can't explain. Either this is a
> bug in the SAX parser, or I need someone to explain why it is like it
is...
>
> I run the following (wellformed, but of course not valid) xml file through
a
> SAX parser:
>
> <?xml version="1.0" encoding="iso-8859-2"?>
> <html>
> ù&#249;
> </html>
>
> The first character in the line after <html> is - in iso-8859-1 encoding -
a
> lowercase 'u' with an accent grave . Its code is 249 = 0xF9. In
iso-8859-2,
> this is a lowercase 'u' with a circle above. And as the document header
> specifies encoding=iso-8859-2, the input reader must transcode it to an
XML
> internal representation of 0x16F (the "Unicode" code of the 'u' with a
> circle). This works nicely.
>
> However, the character entity &#249; ends up as 0xF9 (= 249, the 'u' with
an
> accent grave) in the XML internal representation. This seems wrong to me,
as
> it's just a different way of writing a circled 'u' in iso-8859-2, right?
>
> I had a look at what's going on, and found that the transcoding of the
input
> is done *before* the character entities are resolved. Of course, the
> transcoder knows nothing about such entities. Later, the ScanCharData()
> function converts the char entity to a value, but does not run that value
> through the input transcoder. So it seems character entities are never
> transcoded at all.
>
> Is this a bug, or am I missing something fundamental?
>
> If it's a bug: is the fix as simple as running the value (representing the
> char entity) through the transcoder?
>
> I feel I don't know enough about the inner structures of the SAX parser to
> try and fix this problem myself before asking the people who have worked
> with Xerces for longer than I did.
>
> Cheers
>   Daniel
>
> -- ------------------------------
> Daniel Schröder (daniel@mozquito.com)
> Senior Software Engineer
> Mozquito Technologies AG
>
> "My software never has bugs. It just develops random features."
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>

Input encoding and character entities

Posted by Daniel Schroeder <da...@mozquito.com>.

I have an application that reads xml files with various input encodings.
Today, I stumbled across some behaviour I can't explain. Either this is a
bug in the SAX parser, or I need someone to explain why it is like it is...

I run the following (wellformed, but of course not valid) xml file through a
SAX parser:

<?xml version="1.0" encoding="iso-8859-2"?>
<html>
ù&#249;
</html>

The first character in the line after <html> is - in iso-8859-1 encoding - a
lowercase 'u' with an accent grave . Its code is 249 = 0xF9. In iso-8859-2,
this is a lowercase 'u' with a circle above. And as the document header
specifies encoding=iso-8859-2, the input reader must transcode it to an XML
internal representation of 0x16F (the "Unicode" code of the 'u' with a
circle). This works nicely.

However, the character entity &#249; ends up as 0xF9 (= 249, the 'u' with an
accent grave) in the XML internal representation. This seems wrong to me, as
it's just a different way of writing a circled 'u' in iso-8859-2, right?

I had a look at what's going on, and found that the transcoding of the input
is done *before* the character entities are resolved. Of course, the
transcoder knows nothing about such entities. Later, the ScanCharData()
function converts the char entity to a value, but does not run that value
through the input transcoder. So it seems character entities are never
transcoded at all.

Is this a bug, or am I missing something fundamental?

If it's a bug: is the fix as simple as running the value (representing the
char entity) through the transcoder?

I feel I don't know enough about the inner structures of the SAX parser to
try and fix this problem myself before asking the people who have worked
with Xerces for longer than I did.

Cheers
  Daniel

-- ------------------------------
Daniel Schröder (daniel@mozquito.com)
Senior Software Engineer
Mozquito Technologies AG

"My software never has bugs. It just develops random features."

RE: newbie question regarding XMLCh

Posted by Radovan Chytracek <Ra...@cern.ch>.

> Am writing a SAX handler based on the sample SAXPrint.  Can someone
> tell me where to find the
> details for the XMLCh class?  All I've been able to figure out so
> far is that it is Unicode,
> but want to know what functions/operators are available, especially
> how to convert from a
> simple char array..
Hi,

     on most of the platforms supported by Xerces-C it's a typedef either to
unsigned short or wide character (if supported). To get a simple char string
you have to look at the XMLString::transcode(...) methods. Please be aware
that
the returned strings you have to delete yourselves to avoid a memory leak.


        Rado

newbie question regarding XMLCh

Posted by Jeff Wilson <jw...@home.com>.

Am writing a SAX handler based on the sample SAXPrint.  Can someone tell me where to find the
details for the XMLCh class?  All I've been able to figure out so far is that it is Unicode,
but want to know what functions/operators are available, especially how to convert from a
simple char array..

Thanks.

--
Jeff Wilson
jwilson2000@home.com

Engineering Animation, Inc.
Dimensional Management Group
Southfield, Michigan
Eugene, Oregon
www.eai.com

(541) 684-8590

"If a man could have just half his wishes, he would double his troubles."
  - Benjamin Franklin

"A little nonsense now and then is relished by the wisest men."
  - Willie Wonka

Re: Mailing List Archive

Posted by Craig Noah <Cr...@ca.com>.

I have no problem with your use of the feater for the archive site, and I would
give a +1 (if my vote counts) to linking your archive site to the appropriate
apache sites.  Anyone else?

Craig

Sander van Zoest wrote:

> Hi,
>
>         <http://archive.covalent.net/>
>
> sort of says it all;  I kind of took over Dirk's archives at
> <http://xml-archives.webweaving.org> and rewrote them completely in xml/xsl.
>
> Covalent has donated a machine and is willing to supply bandwidth
> and some of my time.
>
> Would you guys a) not get mad at me for snarfing the feather from
> modules.apache.org and b) would it be an idea to link this new
> site from the various apache sites?
>
> Cheers,
>
> --
> Sander van Zoest                                         [sander@covalent.net]
> Covalent Technologies, Inc.                          http://www.covalent.net/
> (415) 536-5218                                http://www.vanzoest.com/sander/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

--
Craig Noah                          INTERNET: Craig.Noah@ca.com
Programmer                                  Computer Associates
1404 Fort Crook Road South         Phone:  (402) 291-8300 x 284
Bellevue,  NE   68005-2969         FAX:    (402) 291-4362