You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by Gary L Peskin <ga...@firstech.com> on 2000/09/27 08:07:12 UTC

Re: XalanJ2 XMLSerializer problems [1 of 3]

Scott_Boag@lotus.com wrote:
> Myriam should be checking in  chage to ResultTreeHandler from the Xalan
> side (so that startDocument is properly called in this case).  But the real
> problem is in the Xerces serializer.  I'm pretty sure a fix, which-ever way
> you do it, will require a change to the Xerces serializer.  Remember that
> Xalan1 normally used it's own serializers.

Myriam's change did not solve the problem in my example.

I decided to dig in to the ResultTreeHandler code and try to understand
it for myself.  In the process, I cleaned up the code a bit and added
some short comments to explain the various fields.  Also, I renamed
m_mustFlushStartDoc, which made no sense to me, to m_haveDocContent.  I
ended up pulling out the newly added code which wasn't necessary and
moved some code around to where it made more sense to me and also worked
correctly in all cases.

I've submitted the diffs for Scott and Myriam to look at as [2 of 3] of
this message.  Please review them and let me know what you think.

The trouble does, indeed, lie within XMLSerializer and
BaseMarkupSerializer in org.apache.xml.serialize, part of the Xerces
project.  My confusion was initially in not
realizing that there are two similar but different types of
xml-declarations:
(1) The XMLDecl (http://www.w3.org/TR/REC-xml#NT-XMLDecl) which appears
at the beginning of a well-formed XML document and (2) the TextDecl
(http://www.w3.org/TR/REC-xml#NT-TextDecl) which appears at the
beginning of a well-formed external general parsed entity or well-formed
external parameter entity.

The XMLDecl and TextDecl look almost the same except the XMLDecl allows
for a standalone= pseudoattribute in the XMLDecl.  Xerces currently only
handles the creation of an XMLDecl, not a TextDecl.  I have composed a
proposed letter to the Xerces people and sent that as [3 of 3] of this
message.  We can discuss changes and then forward it on to the Xerces
people.

> from Scott) Hi Gary.  The issue seems to be with non-whitespace text nodes
> being output on the root.  Both the Xerces serializer classes and the Xalan
> ResultTreeHandler classes have problems with this.  Though it is somewhat
> illegal, I think we should still probably try and handle this case.  One
> thing we could do as a fairly quick, and maybe correct, fix, is to supress
> output of the XML decl in this case, since the output really isn't legal
> XML.  What do you think?

With respect to XSLT, section 16.1
(http://www.w3.org/TR/xslt#section-XML-Output-Method) is pretty clear on
this, I think:  "The xml output method outputs the result tree as a
well-formed XML external general parsed entity.  If the root node of the
result tree has a single element node child and no text node children,
then the entity should also be a well-formed XML document entity....
The xml output method should output an XML declaration unless the
omit-xml-declaration attribute has the value yes.  The XML declaration
should include both version information and an encoding declaration.  If
the standalone attribute is specified, it should include a standalone
document declaration with the same value as the value as the value of
the standalone attribute. Otherwise, it should not include a standalone
document declaration; this ensures that it is both a XML declaration
(allowed at the beginning of a document entity) and a text declaration
(allowed at the beginning of an external general parsed entity).

So, what I think is that is this is not only _not_ illegal but
specifically allowed in the XSLT spec.  So, I think we need to persuade
the Xerces people to support it if we want to call Xalan compliant.

Gary

What to do with the &

Posted by Eric Advincula <er...@earthlink.net>.
I have a set of text that I'm combining in an xml file

_strXML is a string

        _strXML += "<Description>";
        _strXML += _result.getString( "Description" );
        _strXML += "</Description>";

The problem is that the result from the database returns a string with the
following:

Ex "This is a & text"

and with the '&' I get the following error:

XSL Error: Could not parse input XML document
XSL  Error: SAX Exception
org.apache.xalan.xslt.XSLProcessorException: The entity name must be
immidiately follow the '&' in the entity reference.

Any way of going around this?

Thanks



Re: XalanJ2 XMLSerializer problems [3 of 3]

Posted by Gary L Peskin <ga...@firstech.com>.
Gary L Peskin wrote:
> The XMLDecl and TextDecl look almost the same except the XMLDecl allows
> for a standalone= pseudoattribute in the XMLDecl.  Xerces currently only
> handles the creation of an XMLDecl, not a TextDecl.  I have composed a
> proposed letter to the Xerces people and sent that as [3 of 3] of this
> message.  We can discuss changes and then forward it on to the Xerces
> people.

Here is my proposed email to the Xerces people:

We are currently using your org.apache.xml.serialize.XMLSerializer class
and it base class, BaseMarkupSerializer.  We're using the SAX
interfaces.

Unless suppressed by a call to OutputFormat.setOmitXMLDeclaration(true),
the XMLSerializer class will automatically emit an XMLDecl
(http://www.w3.org/TR/REC-xml#NT-XMLDecl) upon encountering the first
startElement() (or serializeElement()) call.  This is fine when we are
generating well-formed XML document entities.

However, we also are attempting to use XMLSerializer to output
well-formed XML external general parsed entities.  In this case, we need
XMLSerializer to emit a TextDecl
(http://www.w3.org/TR/REC-xml#NT-TextDecl) at the beginning of the
output.  It currently does not do this.

Thus, we attempt to output the following well-formed XML external
general parsed entity:

-------------------------------------------------------------------
This is a test
*abc<h1>dummy element content</h1>def*
-------------------------------------------------------------------

We do this by calling:
startDocument()
characters() for "This is a test\n"
characters() for "*abc"
startElement() for <h1>
characters() for "dummy element content"
endElement() for </h1>
characters() for "def*"
endDocument()

The output produced by the XMLSerializer looks like this:
-------------------------------------------------------------------
This is a test
*abc<?xml version="1.0" encoding="UTF-8"?>
<h1>dummy element content</h1>def*
-------------------------------------------------------------------

with the XMLDecl immediately preceding the first element (<h1> in the
output.

What we need is this:
-------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
This is a test
*abc<h1>dummy element content</h1>def*
-------------------------------------------------------------------

In other words, if the output consists only of comments, processing
instructions, whitespace, and a doctype declaration before the first
element, then continue to output an XMLDecl.  Otherwise, output a
TextDecl at the beginning of the serialized output.

We need this change so that XalanJ2 can be compliant with the XSLT
Recommendation.

Can you please give us your thoughts on this change?  We can supply the
diffs to implement this change for your review if you think that would
be best.

Thanks,