You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Kurt Sorge <ks...@touchnet.com> on 2000/12/04 16:55:53 UTC

using preformatted text

I have an xml file where some one of the nodes is preformatted text for an
ASCII file. Some of the characters in this node are form feed characters
(0x0C). The Xerces parser doesn't recognize this character. I tried to set
this data in a <![CDATA[...]]> block, thinking that the parser would pass
over this data unparsed. It didn't work. Another possibility is to replace
all the form feed characters with an <ff/> tag.

Is there a way to keep the data unchanged, or will I have to live with a
hack?

Kurt



Re: using preformatted text

Posted by Kurt Sorge <ks...@touchnet.com>.
The application using is writing the  <text/> area to a file to be picked up
by a fax server. I don't think the server will recognize a Unicode FF and I
am fairly sure from the samples I have seen that the "]]>" character string
will not appear in the data. I don't foresee any other transformations on
the data than ASCII text data.

I appreciate all the help I have received on the subject

Kurt
----- Original Message -----
From: "Bill Schindler" <de...@bitranch.com>
To: <xe...@xml.apache.org>
Sent: Monday, December 04, 2000 1:14 PM
Subject: Re: using preformatted text


"Kurt Sorge" <ks...@touchnet.com> wrote:
> <text>
> <![CDATA[The first page of text]]>
> <ff/>
> <![CDATA[The next page of text]]>
> </text>
>
> Does anyone see any problems with this?

As long as the text doesn't contain "]]>" (the CDATA end), there shouldn't
be any problems. If that rather unlikely sequence appears, then you'll have
to split it up into two CDATA sections.

If you foresee any need to transform the data to some other format at a
later date, it might be worth considering whether you should wrap each page
as an element. (It might slightly simplify writing a transform [XSLT] to
convert to HTML, XSL:FO, rtf, or whatever.) In other words, instead of what
you have above, you might use:

 <text>
  <page>
    <!-- first page of text -->
  </page>
  <page>
    <!-- next page of text -->
  </page>
 </text>


--Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org




Re: using preformatted text

Posted by Bill Schindler <de...@bitranch.com>.
"Kurt Sorge" <ks...@touchnet.com> wrote:
> <text>
> <![CDATA[The first page of text]]>
> <ff/>
> <![CDATA[The next page of text]]>
> </text>
> 
> Does anyone see any problems with this?

As long as the text doesn't contain "]]>" (the CDATA end), there shouldn't
be any problems. If that rather unlikely sequence appears, then you'll have
to split it up into two CDATA sections.

If you foresee any need to transform the data to some other format at a
later date, it might be worth considering whether you should wrap each page
as an element. (It might slightly simplify writing a transform [XSLT] to
convert to HTML, XSL:FO, rtf, or whatever.) In other words, instead of what
you have above, you might use:

 <text>
  <page>
    <!-- first page of text -->
  </page>
  <page>
    <!-- next page of text -->
  </page>
 </text>


--Bill

Re: using preformatted text

Posted by Kurt Sorge <ks...@touchnet.com>.
I think I should do both. I will put the main data inside a <![CDATA[...]]>
node and replace the FF characters with the <ff/> tag. I think the xml will
look like this:

<text>
<![CDATA[The first page of text]]>
<ff/>
<![CDATA[The next page of text]]>
</text>

Does anyone see any problems with this?

Kurt
----- Original Message -----
From: "Bill Schindler" <de...@bitranch.com>
To: <xe...@xml.apache.org>
Sent: Monday, December 04, 2000 10:54 AM
Subject: Re: using preformatted text


"Kurt Sorge" <ks...@touchnet.com> wrote:
> Some of the characters in this node are form feed characters
> (0x0C). The Xerces parser doesn't recognize this character.

No XML parser should allow the 0x0C character since XML 1.0 defines it as
an illegal character. (See the XML spec, section 2.2 --
http://www.w3.org/TR/2000/REC-xml-20001006)

> Another possibility is to replace
> all the form feed characters with an <ff/> tag.

That seems to me to be the best work-around.

"Jesse Pelton" <js...@PKC.com> wrote:
> If it were me, I'd encode it as #xC ...

Except that &#x0C; still inserts a 0x0C character, which is an illegal
character no matter how it gets into the text. (Parsing a document
containing &#x0C; shows that Xerces agrees with that.)

A quick search through the Unicode database doesn't turn up a formfeed
(other than the control character). I suppose you could use some other
Unicode symbol and then special case for it when/if you need to convert back
to the original text.


--Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org




Re: using preformatted text

Posted by Bill Schindler <de...@bitranch.com>.
"Kurt Sorge" <ks...@touchnet.com> wrote:
> Some of the characters in this node are form feed characters
> (0x0C). The Xerces parser doesn't recognize this character.

No XML parser should allow the 0x0C character since XML 1.0 defines it as
an illegal character. (See the XML spec, section 2.2 --
http://www.w3.org/TR/2000/REC-xml-20001006)

> Another possibility is to replace
> all the form feed characters with an <ff/> tag.

That seems to me to be the best work-around.

"Jesse Pelton" <js...@PKC.com> wrote:
> If it were me, I'd encode it as #xC ...

Except that &#x0C; still inserts a 0x0C character, which is an illegal
character no matter how it gets into the text. (Parsing a document
containing &#x0C; shows that Xerces agrees with that.)

A quick search through the Unicode database doesn't turn up a formfeed
(other than the control character). I suppose you could use some other
Unicode symbol and then special case for it when/if you need to convert back
to the original text.


--Bill