You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by "Heinz, Chris" <ch...@hp.com> on 2009/05/20 16:54:13 UTC

problem with special characters / entities

Hey, I'm a noob here, so if anyone wants to point me to the archives of this mailing list to search for my problem, that's fine.

My problem is that I have three special characters being placed into formatted text:  return, non-breaking spaces, and soft hyphens.  I can input them as &#x0D;, &#xA0;, and &#xAD.  The first two Xerces handles fine, the third I seem to be getting a standard hyphen???   But when I write them out, they go in as non-printing control characters.  Xerces can import those fine, so I can round trip, but, the non-printing characters aren't too user-friendly.

I have defined in my dtd file:

<!ENTITY return "&#x0D;">
<!ENTITY nbsp "&#xA0;">
<!ENTITY softhyphen "&#xAD;">

And tried &return;, etc, that didn't seem to work at all.

I've checked DomOptions and looked at DOMSerializer, haven't seen anything that looks like it would help.

Any ideas?

Thanks,
Chris

RE: problem with special characters / entities

Posted by "Heinz, Chris" <ch...@hp.com>.

Dave,

Thanks for the response.  My detailed responses to your questions below.  In general, I think that I see that it (rightly) won't do what I'm trying.  So, no worries.

Thanks again, 
Chris

-----Original Message-----
From: David Bertoni [mailto:dbertoni@apache.org] 
Sent: Wednesday, May 20, 2009 1:34 PM
To: c-dev@xerces.apache.org
Subject: Re: problem with special characters / entities

Heinz, Chris wrote:
> Hey, I'm a noob here, so if anyone wants to point me to the archives of 
> this mailing list to search for my problem, that's fine.
> 
> My problem is that I have three special characters being placed into 
> formatted text:  return, non-breaking spaces, and soft hyphens.  I can 
> input them as &#x0D;, &#xA0;, and &#xAD.  The first two Xerces handles 
> fine, the third I seem to be getting a standard hyphen???
Have you examined the content of the document to verify this?  I don't 
know of any code in Xerces-C that would translate a soft hyphen to a 
regular hyphen.
>>> I think this was on my application end.  My duh.

> But when I 
> write them out, they go in as non-printing control characters.  Xerces 
> can import those fine, so I can round trip, but, the non-printing 
> characters aren't too user-friendly.
I'm not sure I understand your question and the problems you're seeing. 
  Are you trying to configure the serializer so it generates entities 
for certain characters?  If so, there's no way to do that.
>>> Yah, I guess that's what I'm trying to do.  You're right, these are legal Windows-1252 characters, why should Xerces do anything to them?

> 
> I have defined in my dtd file:
> 
> <!ENTITY return "&#x0D;">
> <!ENTITY nbsp "&#xA0;">
> <!ENTITY softhyphen "&#xAD;">
In general, the DTD is processed by the parser, the entities are 
expanded, and their identities are lost. There is no connection between 
the DTD in the input document, and the document the serializer generates.

> 
> And tried &return;, etc, that didn't seem to work at all.
Didn't seem to work in what way?
>>> I put &return; into the input stream, it seemed to be completely ignored -- I got no character placed at that position.  Not a biggie, &#x0d; works fine.

> I've checked DomOptions and looked at DOMSerializer, haven't seen 
> anything that looks like it would help.
The usual way to handle this is to specify US-ASCII as the encoding. 
Since that encoding only supports characters below 128, all other 
characters will be written as numeric character references.

However, that will not solve the problem with the U+000D, which should 
already be written as a numeric character reference.  If that's not the 
case, the Xerces serializer has a bug.
>>>  I've attached the output file created (note, via a MemBufFormatTarget rather than directly to a file), there are single x0Ds at the end of each of the <fo:inline>s in the 2nd <fo:block>.  Note, Xerces did not create the overall XML of this file, just the document embedded in the [CDATA[

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org

Re: problem with special characters / entities

Posted by David Bertoni <db...@apache.org>.

Heinz, Chris wrote:
> Hey, I’m a noob here, so if anyone wants to point me to the archives of 
> this mailing list to search for my problem, that’s fine.
> 
> My problem is that I have three special characters being placed into 
> formatted text:  return, non-breaking spaces, and soft hyphens.  I can 
> input them as &#x0D;, &#xA0;, and &#xAD.  The first two Xerces handles 
> fine, the third I seem to be getting a standard hyphen???
Have you examined the content of the document to verify this?  I don't 
know of any code in Xerces-C that would translate a soft hyphen to a 
regular hyphen.

> But when I 
> write them out, they go in as non-printing control characters.  Xerces 
> can import those fine, so I can round trip, but, the non-printing 
> characters aren’t too user-friendly.
I'm not sure I understand your question and the problems you're seeing. 
  Are you trying to configure the serializer so it generates entities 
for certain characters?  If so, there's no way to do that.

> 
> I have defined in my dtd file:
> 
> <!ENTITY return "&#x0D;">
> <!ENTITY nbsp "&#xA0;">
> <!ENTITY softhyphen "&#xAD;">
In general, the DTD is processed by the parser, the entities are 
expanded, and their identities are lost. There is no connection between 
the DTD in the input document, and the document the serializer generates.

> 
> And tried &return;, etc, that didn’t seem to work at all.
Didn't seem to work in what way?

> I’ve checked DomOptions and looked at DOMSerializer, haven’t seen 
> anything that looks like it would help.
The usual way to handle this is to specify US-ASCII as the encoding. 
Since that encoding only supports characters below 128, all other 
characters will be written as numeric character references.

However, that will not solve the problem with the U+000D, which should 
already be written as a numeric character reference.  If that's not the 
case, the Xerces serializer has a bug.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org