You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Steve Carton <st...@retrievalsystems.com> on 2007/11/06 22:10:45 UTC

Split CDATA Sections and the division Symbol (x00f7)

I'm trying to figure out if this is a bug or not. I created a DOM with
an element with a CDATA section and I set the value to a String of
characters which include a division symbol (xF7). (I actually do this by
reading the characters in from a file and converting them from bytes to
a String specifying a Windows-1252 encoding.) When I serialize this DOM
out to a String, byte array or anything else, the CData section is split
around the division symbol and the division symbol is emitted as an
entity (&#xF7;). I do try to serialize this as UTF-8. 
 
I see in the documentation that this is the correct behavior when the
serializer encounters a Unicode character that isn't recognized; not
sure if this means not recognized in the Unicode (internal) form or
there is no UTF-8 equivalent. But x00F7 seems to be the correct Unicode
value for a division symbol and there is a UTF-8 encoding for it.  Other
"special" characters seem to serialize to UTF-8 without this split. 
 
I can send code. I've tried this on the latest Xerces-J. Anyone have any
thoughts about it?
 
Thanks,
 
Steve Carton

RE: Split CDATA Sections and the division Symbol (x00f7)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

For what it's worth, the deprecated serializer is now fixed [1].

Thanks.

[1] http://marc.info/?l=xerces-cvs&m=119455107025507&w=2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Michael Glavassevich/Toronto/IBM@IBMCA wrote on 11/08/2007 02:33:55 PM:

> Hi Steve,
>
> Do you have serializer.jar (containing the LSSerializer from Xalan) on
your
> classpath? I can only reproduce this with Xerces' implementation of
> LSSerializer which I might add is also deprecated.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
> "Steve Carton" <st...@retrievalsystems.com> wrote on 11/08/2007
> 11:48:16 AM:
>
> > Hi Michael,
> >
> > I've fooled with this in several forms, always with the same
> > results. My current incarnation of the code uses the LSSerializer
> > API. I've also used the (deprecated) XMLSerializer. In either case,
> > I've tried StringWriter, FileWriter, and ByteArrayOutputStream (then
> > to a FileOutputStream to write to a file). I specify UTF-8 as the
> > output encoding. Here's a snippet of the code:
> >
> >       System.setProperty(DOMImplementationRegistry.PROPERTY,"org.
> > apache.xerces.dom.DOMImplementationSourceImpl");
> >       DOMImplementationRegistry registry =
> > DOMImplementationRegistry.newInstance();
> >       DOMImplementation domImpl = registry.getDOMImplementation("LS
> 3.0");
> >       DOMImplementationLS implLS = (DOMImplementationLS)domImpl;
> >       LSSerializer dom3Writer = implLS.createLSSerializer();
> >       LSOutput output=implLS.createLSOutput();
> >       ByteArrayOutputStream bs = new ByteArrayOutputStream();
> >       output.setByteStream(bs);
> >       output.setEncoding("UTF-8");
> >       dom3Writer.write(doc,output);
> >
> > Here's what get's written to a file from that byte stream:
> >
> > <test><div>Â¦Âº3 times: Ã· Ã· Ã·ÂºÂ¬</div><divCDATA><![CDATA[Â¦Âº3
> > times: ]]>&#xf7;<![CDATA[ ]]>&#xf7;<![CDATA[ ]]>&#xf7;<!
> > [CDATA[ÂºÂ¬]]></divCDATA></test>
> >
> > Note that the serialized element that is *not* a cdata section
> > converts the division symbol to UTF-8 without a problem.
> >
> > Steve
> >
> > -----Original Message-----
> > From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com]
> > Sent: Wednesday, November 07, 2007 11:04 PM
> > To: j-users@xerces.apache.org
> > Cc: Steve Carton
> > Subject: Re: Split CDATA Sections and the division Symbol (x00f7)
> >
> > Hi Steve,
> >
> > "Steve Carton" <st...@retrievalsystems.com> wrote on 11/06/2007
> > 04:10:45 PM:
> >
> > > I'm trying to figure out if this is a bug or not. I created a DOM
with
> > > an element with a CDATA section and I set the value to a String of
> > > characters which include a division symbol (xF7). (I actually do this
> > > by reading the characters in from a file and converting them from
> > > bytes to a String specifying a Windows-1252 encoding.) When I
> > > serialize this DOM out to a String, byte array or anything else, the
> > > CData section is split around the division symbol and the division
> > > symbol is emitted as an entity (&#xF7;). I do try to serialize this
as
> > UTF-8.
> >
> > Some questions ...
> >
> > What API are you using for serialization? Are you specifying an
> > output encoding? What type of output are you writing to? A java.io.
> > OutputStream? A java.io.Writer?
> >
> > > I see in the documentation that this is the correct behavior when the
> > > serializer encounters a Unicode character that isn't recognized; not
> > > sure if this means not recognized in the Unicode (internal) form or
> > > there is no UTF-8 equivalent. But x00F7 seems to be the correct
> > > Unicode value for a division symbol and there is a UTF-8 encoding for
> > > it.  Other "special" characters seem to serialize to UTF-8 without
> > > this split.
> >
> > I think what you meant to say here is "not expressible in the output
> > encoding". For instance ASCII is only capable of representing
> > Unicode code points from 0x00-0x7F. 0xF7 isn't representable in ASCII.
> >
> > > I can send code. I've tried this on the latest Xerces-J. Anyone have
> > > any thoughts about it?
> > >
> > > Thanks,
> > >
> > > Steve Carton
> >
> > Thanks.
> >
> > Michael Glavassevich
> > XML Parser Development
> > IBM Toronto Lab
> > E-mail: mrglavas@ca.ibm.com
> > E-mail: mrglavas@apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-users-help@xerces.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

RE: Split CDATA Sections and the division Symbol (x00f7)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Steve,

Do you have serializer.jar (containing the LSSerializer from Xalan) on your
classpath? I can only reproduce this with Xerces' implementation of
LSSerializer which I might add is also deprecated.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

"Steve Carton" <st...@retrievalsystems.com> wrote on 11/08/2007
11:48:16 AM:

> Hi Michael,
>
> I've fooled with this in several forms, always with the same
> results. My current incarnation of the code uses the LSSerializer
> API. I've also used the (deprecated) XMLSerializer. In either case,
> I've tried StringWriter, FileWriter, and ByteArrayOutputStream (then
> to a FileOutputStream to write to a file). I specify UTF-8 as the
> output encoding. Here's a snippet of the code:
>
>       System.setProperty(DOMImplementationRegistry.PROPERTY,"org.
> apache.xerces.dom.DOMImplementationSourceImpl");
>       DOMImplementationRegistry registry =
> DOMImplementationRegistry.newInstance();
>       DOMImplementation domImpl = registry.getDOMImplementation("LS
3.0");
>       DOMImplementationLS implLS = (DOMImplementationLS)domImpl;
>       LSSerializer dom3Writer = implLS.createLSSerializer();
>       LSOutput output=implLS.createLSOutput();
>       ByteArrayOutputStream bs = new ByteArrayOutputStream();
>       output.setByteStream(bs);
>       output.setEncoding("UTF-8");
>       dom3Writer.write(doc,output);
>
> Here's what get's written to a file from that byte stream:
>
> <test><div>Â¦Âº3 times: Ã· Ã· Ã·ÂºÂ¬</div><divCDATA><![CDATA[Â¦Âº3
> times: ]]>&#xf7;<![CDATA[ ]]>&#xf7;<![CDATA[ ]]>&#xf7;<!
> [CDATA[ÂºÂ¬]]></divCDATA></test>
>
> Note that the serialized element that is *not* a cdata section
> converts the division symbol to UTF-8 without a problem.
>
> Steve
>
> -----Original Message-----
> From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com]
> Sent: Wednesday, November 07, 2007 11:04 PM
> To: j-users@xerces.apache.org
> Cc: Steve Carton
> Subject: Re: Split CDATA Sections and the division Symbol (x00f7)
>
> Hi Steve,
>
> "Steve Carton" <st...@retrievalsystems.com> wrote on 11/06/2007
> 04:10:45 PM:
>
> > I'm trying to figure out if this is a bug or not. I created a DOM with
> > an element with a CDATA section and I set the value to a String of
> > characters which include a division symbol (xF7). (I actually do this
> > by reading the characters in from a file and converting them from
> > bytes to a String specifying a Windows-1252 encoding.) When I
> > serialize this DOM out to a String, byte array or anything else, the
> > CData section is split around the division symbol and the division
> > symbol is emitted as an entity (&#xF7;). I do try to serialize this as
> UTF-8.
>
> Some questions ...
>
> What API are you using for serialization? Are you specifying an
> output encoding? What type of output are you writing to? A java.io.
> OutputStream? A java.io.Writer?
>
> > I see in the documentation that this is the correct behavior when the
> > serializer encounters a Unicode character that isn't recognized; not
> > sure if this means not recognized in the Unicode (internal) form or
> > there is no UTF-8 equivalent. But x00F7 seems to be the correct
> > Unicode value for a division symbol and there is a UTF-8 encoding for
> > it.  Other "special" characters seem to serialize to UTF-8 without
> > this split.
>
> I think what you meant to say here is "not expressible in the output
> encoding". For instance ASCII is only capable of representing
> Unicode code points from 0x00-0x7F. 0xF7 isn't representable in ASCII.
>
> > I can send code. I've tried this on the latest Xerces-J. Anyone have
> > any thoughts about it?
> >
> > Thanks,
> >
> > Steve Carton
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: mrglavas@ca.ibm.com
> E-mail: mrglavas@apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

RE: Split CDATA Sections and the division Symbol (x00f7)

Posted by Steve Carton <st...@retrievalsystems.com>.

Hi Michael,

I've fooled with this in several forms, always with the same results. My current incarnation of the code uses the LSSerializer API. I've also used the (deprecated) XMLSerializer. In either case, I've tried StringWriter, FileWriter, and ByteArrayOutputStream (then to a FileOutputStream to write to a file). I specify UTF-8 as the output encoding. Here's a snippet of the code:

		System.setProperty(DOMImplementationRegistry.PROPERTY,"org.apache.xerces.dom.DOMImplementationSourceImpl");
		DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
		DOMImplementation domImpl = registry.getDOMImplementation("LS 3.0");
		DOMImplementationLS implLS = (DOMImplementationLS)domImpl;
		LSSerializer dom3Writer = implLS.createLSSerializer();
		LSOutput output=implLS.createLSOutput();
		ByteArrayOutputStream bs = new ByteArrayOutputStream();
		output.setByteStream(bs);
		output.setEncoding("UTF-8");
		dom3Writer.write(doc,output);

Here's what get's written to a file from that byte stream:

<test><div>Â¦Âº3 times: Ã· Ã· Ã·ÂºÂ¬</div><divCDATA><![CDATA[Â¦Âº3 times: ]]>&#xf7;<![CDATA[ ]]>&#xf7;<![CDATA[ ]]>&#xf7;<![CDATA[ÂºÂ¬]]></divCDATA></test>

Note that the serialized element that is *not* a cdata section converts the division symbol to UTF-8 without a problem.

Steve

-----Original Message-----
From: Michael Glavassevich [mailto:mrglavas@ca.ibm.com] 
Sent: Wednesday, November 07, 2007 11:04 PM
To: j-users@xerces.apache.org
Cc: Steve Carton
Subject: Re: Split CDATA Sections and the division Symbol (x00f7)

Hi Steve,

"Steve Carton" <st...@retrievalsystems.com> wrote on 11/06/2007
04:10:45 PM:

> I'm trying to figure out if this is a bug or not. I created a DOM with 
> an element with a CDATA section and I set the value to a String of 
> characters which include a division symbol (xF7). (I actually do this 
> by reading the characters in from a file and converting them from 
> bytes to a String specifying a Windows-1252 encoding.) When I 
> serialize this DOM out to a String, byte array or anything else, the 
> CData section is split around the division symbol and the division 
> symbol is emitted as an entity (&#xF7;). I do try to serialize this as
UTF-8.

Some questions ...

What API are you using for serialization? Are you specifying an output encoding? What type of output are you writing to? A java.io.OutputStream? A java.io.Writer?

> I see in the documentation that this is the correct behavior when the 
> serializer encounters a Unicode character that isn't recognized; not 
> sure if this means not recognized in the Unicode (internal) form or 
> there is no UTF-8 equivalent. But x00F7 seems to be the correct 
> Unicode value for a division symbol and there is a UTF-8 encoding for 
> it.  Other "special" characters seem to serialize to UTF-8 without 
> this split.

I think what you meant to say here is "not expressible in the output encoding". For instance ASCII is only capable of representing Unicode code points from 0x00-0x7F. 0xF7 isn't representable in ASCII.

> I can send code. I've tried this on the latest Xerces-J. Anyone have 
> any thoughts about it?
>
> Thanks,
>
> Steve Carton

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Split CDATA Sections and the division Symbol (x00f7)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Steve,

"Steve Carton" <st...@retrievalsystems.com> wrote on 11/06/2007
04:10:45 PM:

> I'm trying to figure out if this is a bug or not. I created a DOM
> with an element with a CDATA section and I set the value to a String
> of characters which include a division symbol (xF7). (I actually do
> this by reading the characters in from a file and converting them
> from bytes to a String specifying a Windows-1252 encoding.) When I
> serialize this DOM out to a String, byte array or anything else, the
> CData section is split around the division symbol and the division
> symbol is emitted as an entity (&#xF7;). I do try to serialize this as
UTF-8.

Some questions ...

What API are you using for serialization? Are you specifying an output
encoding? What type of output are you writing to? A java.io.OutputStream? A
java.io.Writer?

> I see in the documentation that this is the correct behavior when
> the serializer encounters a Unicode character that isn't recognized;
> not sure if this means not recognized in the Unicode (internal) form
> or there is no UTF-8 equivalent. But x00F7 seems to be the correct
> Unicode value for a division symbol and there is a UTF-8 encoding
> for it.  Other "special" characters seem to serialize to UTF-8
> without this split.

I think what you meant to say here is "not expressible in the output
encoding". For instance ASCII is only capable of representing Unicode code
points from 0x00-0x7F. 0xF7 isn't representable in ASCII.

> I can send code. I've tried this on the latest Xerces-J. Anyone have
> any thoughts about it?
>
> Thanks,
>
> Steve Carton

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Split CDATA Sections and the division Symbol (x00f7)

Posted by ke...@us.ibm.com.

>I don't see 0xF7 in that list.

You're right. Apparently I've still got a bit of dyslexia; I managed to
misread #x7F as #xF7. Sorry about the confusion.

("Caution: To avoid damage to reputation, engage brain before putting
fingers in gear.")

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Split CDATA Sections and the division Symbol (x00f7)

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Hi Joe,

I don't see 0xF7 in that list. Checking the Unicode code charts it's
defined to be the division sign [1] (which is what Steve said it was in his
post).

Thanks.

[1] http://www.unicode.org/charts/PDF/U0080.pdf

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

keshlam@us.ibm.com wrote on 11/07/2007 09:43:18 AM:

> For what it's worth, 0xF7 is one of the characters which both the
> XML 1.0 and 1.1 recommendations suggest should be avoided by
> document authors. "They are either control characters or permanently
> undefined Unicode characters" (http://www.w3.org/TR/REC-xml/#charsets)
>
> ______________________________________
> "... Three things see no end: A loop with exit code done wrong,
> A semaphore untested, And the change that comes along. ..."
> -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.
> org/pegasus/songs/threes-rev-11.html)


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Split CDATA Sections and the division Symbol (x00f7)

Posted by ke...@us.ibm.com.

For what it's worth, 0xF7 is one of the characters which both the XML 1.0
and 1.1 recommendations suggest should be avoided by document authors. "?
They are either control characters or permanently undefined Unicode
characters" (http://www.w3.org/TR/REC-xml/#charsets)

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)