You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Mike Strauch <mi...@hannonhill.com> on 2008/04/16 21:43:23 UTC
Identity transformation producing invalid XML
Hello,
I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
following issues:
I'm using a TransformerIdentityImpl to transform the html below and the
result includes a lot of information that I believe is coming from the
dtd associated with the doctype, and I'm not sure why it is being
included. This alone is not my only concern. When I attempt to
validate the result as xml I receive the following error:
"White space is required after "<!ENTITY" in the entity declaration."
I am setting the following output properties on the transformer itself:
omit-xml-declaration: no
standalone: no
method: xml
I have fiddled around with these output properties and have not been
able to acquire a result that does not include all of the doctype
information. Is there something I am missing?
Original:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test of lazy DOM builder</title>
</head>
<body>
<p>Some text here is ok</p>
</body>
</html>
Transformed:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [\r\n<!ENTITY
%HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" >\r\n<!ENTITY
%HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN"
>\r\n<!ENTITY %HTMLspecial PUBLIC "-//W3C//ENTITIES Special for
XHTML//EN" >\r\n<!--================== Imported Names
====================================--><!-- media type, as per [RFC2045]
--><!-- comma-separated list of media types, as per [RFC2045] --><!-- a
character encoding, as per [RFC2045] --><!-- a space separated list of
character encodings, as per [RFC2045] --><!-- a language code, as per
[RFC3066] --><!-- a single character, as per section 2.2 of [XML]
--><!-- one or more digits --><!-- space-separated list of link types
--><!-- single or comma-separated list of media descriptors --><!-- a
Uniform Resource Identifier, see [RFC2396] --><!-- a space separated
list of Uniform Resource Identifiers --><!-- date and time information.
ISO date format --><!-- script expression --><!-- style sheet data
--><!-- used for titles etc. --><!-- render in this frame --><!-- nn for
pixels or nn% for percentage length --><!-- pixel, percentage, or
relative --><!-- integer representing length in pixels --><!-- these are
used for image maps --><!-- comma separated list of lengths --><!-- used
for object, applet, img, input and iframe --><!-- a color using sRGB:
#RRGGBB as Hex values --><!-- There are also 16 widely known color names
with their sRGB values:\n\n Black = #000000 Green = #008000\n
Silver = #C0C0C0 Lime = #00FF00\n Gray = #808080 Olive =
#808000\n White = #FFFFFF Yellow = #FFFF00\n Maroon =
#800000 Navy = #000080\n Red = #FF0000 Blue =
#0000FF\n Purple = #800080 Teal = #008080\n Fuchsia=
#FF00FF Aqua = #00FFFF\n--><!--=================== Generic
Attributes ===============================--><!-- core attributes common
to most elements\n id document-wide unique id\n class space
separated list of classes\n style associated style info\n title
advisory title/amplification\n--><!-- internationalization attributes\n
lang language code (backwards compatible)\n xml:lang language
code (as per XML 1.0 spec)\n dir direction for weak/neutral
text\n--><!-- attributes for common UI events\n onclick a pointer
button was clicked\n ondblclick a pointer button was double clicked\n
onmousedown a pointer button was pressed down\n onmouseup a pointer
button was released\n onmousemove a pointer was moved onto the
element\n onmouseout a pointer was moved away from the element\n
onkeypress a key was pressed and released\n onkeydown a key was
pressed down\n onkeyup a key was released\n--><!-- attributes for
elements that can get the focus\n accesskey accessibility key
character\n tabindex position in tabbing order\n onfocus the
element got the focus\n onblur the element lost the focus\n--><!--
text alignment for p, div, h1-h6. The default is\n align="left" for
ltr headings, "right" for rtl --><!--=================== Text Elements
====================================--><!-- these can occur at block or
inline level --><!-- these can only occur at block level --><!--
%Inline; covers inline or "text-level" elements
--><!--================== Block level elements
==============================--><!-- %Flow; mixes block and inline and
is used for list items etc. --><!--================== Content models for
exclusions =====================--><!-- a elements use %Inline;
excluding a --><!-- pre uses %Inline excluding img, object, applet, big,
small,\n font, or basefont --><!-- form uses %Flow; excluding form
--><!-- button uses %Flow; but excludes a, form, form controls, iframe
--><!--================ Document Structure
==================================--><!-- the namespace URI designates
the document profile --><!--================ Document Head
=======================================--><!-- content model is
%head.misc; combined with a single\n title and an optional base
element in any order --><!-- The title element is not considered part of
the flow of text.\n It should be displayed, for example as the
page header or\n window title. Exactly one title is required per
document.\n --><!-- document base URI --><!-- generic metainformation
--><!--\n Relationship values can be used in principle:\n\n a) for
document specific toolbars/menus when used\n with the link element
in document head e.g.\n start, contents, previous, next, index,
end, help\n b) to link to a separate style sheet
(rel="stylesheet")\n c) to make a link to a script (rel="script")\n
d) by stylesheets to control how collections of\n html nodes are
rendered into printed documents\n e) to make a link to a printable
version of this document\n e.g. a PostScript or PDF version
(rel="alternate" media="print")\n--><!-- style info, which may include
CDATA sections --><!-- script statements, which may include CDATA
sections --><!-- alternate content container for non script-based
rendering --><!--======================= Frames
=======================================--><!-- inline subwindow --><!--
alternate content container for non frame-based rendering
--><!--=================== Document Body
====================================--><!-- generic language/style
container --><!--=================== Paragraphs
=======================================--><!--===================
Headings =========================================--><!--\n There are
six levels of headings from h1 (the most important)\n to h6 (the least
important).\n--><!--=================== Lists
============================================--><!-- Unordered list
bullet styles --><!-- Unordered list --><!-- Ordered list numbering
style\n\n 1 arabic numbers 1, 2, 3, ...\n a lower
alpha a, b, c, ...\n A upper alpha A, B, C,
...\n i lower roman i, ii, iii, ...\n I upper
roman I, II, III, ...\n\n The style is applied to the
sequence number which by default\n is reset to 1 for the first list
item in an ordered list.\n--><!-- Ordered (numbered) list --><!-- single
column list (DEPRECATED) --><!-- multiple column list (DEPRECATED)
--><!-- LIStyle is constrained to: "(%ULStyle;|%OLStyle;)" --><!-- list
item --><!-- definition lists - dt for term, dd for its definition
--><!--=================== Address
==========================================--><!-- information on author
--><!--=================== Horizontal Rule
==================================--><!--===================
Preformatted Text ================================--><!-- content is
%Inline; excluding \n
"img|object|applet|big|small|sub|sup|font|basefont"
--><!--=================== Block-like Quotes
================================--><!--=================== Text
alignment ===================================--><!-- center content
--><!--=================== Inserted/Deleted Text
============================--><!--\n ins/del are allowed in block and
inline content, but its\n inappropriate to include block content within
an ins element\n occurring in inline
content.\n--><!--================== The Anchor Element
================================--><!-- content is %Inline; except that
anchors shouldn't be nested --><!--===================== Inline Elements
================================--><!-- generic language/style container
--><!-- I18N BiDi over-ride --><!-- forced line break --><!-- emphasis
--><!-- strong emphasis --><!-- definitional --><!-- program code
--><!-- sample --><!-- something user would type --><!-- variable
--><!-- citation --><!-- abbreviation --><!-- acronym --><!-- inlined
quote --><!-- subscript --><!-- superscript --><!-- fixed pitch font
--><!-- italic font --><!-- bold font --><!-- bigger font --><!--
smaller font --><!-- underline --><!-- strike-through --><!--
strike-through --><!-- base font size --><!-- local change to font
--><!--==================== Object
======================================--><!--\n object is used to embed
objects as part of HTML pages.\n param elements should precede other
content. Parameters\n can also be expressed as attribute/value pairs on
the\n object element itself when brevity is desired.\n--><!--\n param
is used to supply a named property value.\n In XML it would seem
natural to follow RDF and support an\n abbreviated syntax where the
param elements are replaced\n by attribute value pairs on the object
start tag.\n--><!--=================== Java applet
==================================--><!--\n One of code or object
attributes must be present.\n Place param elements before other
content.\n--><!--=================== Images
===========================================--><!--\n To avoid
accessibility problems for people who aren't\n able to see the image,
you should provide a text\n description using the alt and longdesc
attributes.\n In addition, avoid the use of server-side image
maps.\n--><!-- usemap points to a map element which may be in this
document\n or an external document, although the latter is not widely
supported --><!--================== Client-side image maps
============================--><!-- These can be placed in the same
document or grouped in a\n separate document although this isn't yet
widely supported --><!--================ Forms
===============================================--><!-- forms shouldn't
be nested --><!--\n Each label must not contain more than ONE field\n
Label elements shouldn't be nested.\n--><!-- the name attribute is
required for all but submit & reset --><!-- form control --><!-- option
selector --><!-- option group --><!-- selectable choice --><!--
multi-line text field --><!--\n The fieldset element is used to group
form fields.\n Only one legend element should occur in the content\n
and if present should only be preceded by whitespace.\n--><!-- fieldset
label --><!--\n Content is %Flow; excluding a, form, form controls,
iframe\n--><!-- push button --><!-- single-line text input control
(DEPRECATED) --><!--======================= Tables
=======================================--><!-- Derived from IETF HTML
table standard, see [RFC1942] --><!--\n The border attribute sets the
thickness of the frame around the\n table. The default units are screen
pixels.\n\n The frame attribute specifies which parts of the frame
around\n the table should be rendered. The values are not the same as\n
CALS to avoid a name clash with the valign attribute.\n--><!--\n The
rules attribute defines which rules to draw between cells:\n\n If rules
is absent then assume:\n "none" if border is absent or border="0"
otherwise "all"\n--><!-- horizontal placement of table relative to
document --><!-- horizontal alignment attributes for cell contents\n\n
char alignment char, e.g. char=':'\n charoff offset for
alignment char\n--><!-- vertical alignment attributes for cell contents
--><!--\ncolgroup groups a set of col elements. It allows you to
group\nseveral semantically related columns together.\n--><!--\n col
elements define the alignment properties for cells in\n one or more
columns.\n\n The width attribute specifies the width of the columns,
e.g.\n\n width=64 width in screen pixels\n
width=0.5* relative width of 0.5\n\n The span attribute causes the
attributes of one\n col element to apply to more than one
column.\n--><!--\n Use thead to duplicate headers when breaking
table\n across page boundaries, or for static headers when\n tbody
sections are rendered in scrolling panel.\n\n Use tfoot to duplicate
footers when breaking table\n across page boundaries, or for static
footers when\n tbody sections are rendered in scrolling panel.\n\n
Use multiple tbody sections when rules are needed\n between groups of
table rows.\n--><!-- Scope is simpler than headers attribute for common
tables --><!-- th is for headers, td for data and for cells acting as
both -->]>\r\n<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en">\r\n\t<head>\r\n\t\t<title>Test of lazy DOM
builder</title>\r\n\t</head>\r\n\t<body>\r\n\t\t<p>Some text here is
ok</p>\r\n\t</body>\r\n</html>
Cheers!
-Mike
Re: Identity transformation producing invalid XML
Posted by Bradley Wagner <br...@hannonhill.com>.
Mike,
When I tried this, the error that I got seemed to indicate that the
entities output in the result of the transformation should have had a
space between % sign and the word HTML.
I too am curious about this one.
Bradley
On Apr 16, 2008, at 3:43 PM, Mike Strauch wrote:
> Hello,
>
> I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
> following issues:
>
> I'm using a TransformerIdentityImpl to transform the html below and
> the result includes a lot of information that I believe is coming
> from the dtd associated with the doctype, and I'm not sure why it is
> being included. This alone is not my only concern. When I attempt
> to validate the result as xml I receive the following error:
>
> "White space is required after "<!ENTITY" in the entity declaration."
>
> I am setting the following output properties on the transformer
> itself:
>
> omit-xml-declaration: no
> standalone: no
> method: xml
>
> I have fiddled around with these output properties and have not been
> able to acquire a result that does not include all of the doctype
> information. Is there something I am missing?
>
> Original:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html
> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <title>Test of lazy DOM builder</title>
> </head>
> <body>
> <p>Some text here is ok</p>
> </body>
> </html>
>
> Transformed:
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html
> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
> " [\r\n<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for
> XHTML//EN" >\r\n<!ENTITY %HTMLsymbol PUBLIC "-//W3C//ENTITIES
> Symbols for XHTML//EN" >\r\n<!ENTITY %HTMLspecial PUBLIC "-//W3C//
> ENTITIES Special for XHTML//EN" >\r\n<!--================== Imported
> Names ====================================--><!-- media type, as per
> [RFC2045] --><!-- comma-separated list of media types, as per
> [RFC2045] --><!-- a character encoding, as per [RFC2045] --><!-- a
> space separated list of character encodings, as per [RFC2045] --
> ><!-- a language code, as per [RFC3066] --><!-- a single character,
> as per section 2.2 of [XML] --><!-- one or more digits --><!-- space-
> separated list of link types --><!-- single or comma-separated list
> of media descriptors --><!-- a Uniform Resource Identifier, see
> [RFC2396] --><!-- a space separated list of Uniform Resource
> Identifiers --><!-- date and time information. ISO date format --
> ><!-- script expression --><!-- style sheet data --><!-- used for
> titles etc. --><!-- render in this frame --><!-- nn for pixels or nn
> % for percentage length --><!-- pixel, percentage, or relative --
> ><!-- integer representing length in pixels --><!-- these are used
> for image maps --><!-- comma separated list of lengths --><!-- used
> for object, applet, img, input and iframe --><!-- a color using
> sRGB: #RRGGBB as Hex values --><!-- There are also 16 widely known
> color names with their sRGB values:\n\n Black = #000000
> Green = #008000\n Silver = #C0C0C0 Lime = #00FF00\n
> Gray = #808080 Olive = #808000\n White = #FFFFFF Yellow
> = #FFFF00\n Maroon = #800000 Navy = #000080\n Red =
> #FF0000 Blue = #0000FF\n Purple = #800080 Teal =
> #008080\n Fuchsia= #FF00FF Aqua = #00FFFF\n--><!--
> =================== Generic Attributes
> ===============================--><!-- core attributes common to
> most elements\n id document-wide unique id\n class space
> separated list of classes\n style associated style info\n
> title advisory title/amplification\n--><!-- internationalization
> attributes\n lang language code (backwards compatible)\n
> xml:lang language code (as per XML 1.0 spec)\n dir
> direction for weak/neutral text\n--><!-- attributes for common UI
> events\n onclick a pointer button was clicked\n ondblclick a
> pointer button was double clicked\n onmousedown a pointer button
> was pressed down\n onmouseup a pointer button was released\n
> onmousemove a pointer was moved onto the element\n onmouseout a
> pointer was moved away from the element\n onkeypress a key was
> pressed and released\n onkeydown a key was pressed down\n
> onkeyup a key was released\n--><!-- attributes for elements that
> can get the focus\n accesskey accessibility key character\n
> tabindex position in tabbing order\n onfocus the element got
> the focus\n onblur the element lost the focus\n--><!-- text
> alignment for p, div, h1-h6. The default is\n align="left" for
> ltr headings, "right" for rtl --><!--=================== Text
> Elements ====================================--><!-- these can occur
> at block or inline level --><!-- these can only occur at block level
> --><!-- %Inline; covers inline or "text-level" elements --><!--
> ================== Block level elements
> ==============================--><!-- %Flow; mixes block and inline
> and is used for list items etc. --><!--================== Content
> models for exclusions =====================--><!-- a elements use
> %Inline; excluding a --><!-- pre uses %Inline excluding img, object,
> applet, big, small,\n font, or basefont --><!-- form uses %Flow;
> excluding form --><!-- button uses %Flow; but excludes a, form, form
> controls, iframe --><!--================ Document Structure
> ==================================--><!-- the namespace URI
> designates the document profile --><!--================ Document
> Head =======================================--><!-- content model is
> %head.misc; combined with a single\n title and an optional base
> element in any order --><!-- The title element is not considered
> part of the flow of text.\n It should be displayed, for
> example as the page header or\n window title. Exactly one
> title is required per document.\n --><!-- document base URI --
> ><!-- generic metainformation --><!--\n Relationship values can be
> used in principle:\n\n a) for document specific toolbars/menus
> when used\n with the link element in document head e.g.
> \n start, contents, previous, next, index, end, help\n b)
> to link to a separate style sheet (rel="stylesheet")\n c) to make
> a link to a script (rel="script")\n d) by stylesheets to control
> how collections of\n html nodes are rendered into printed
> documents\n e) to make a link to a printable version of this
> document\n e.g. a PostScript or PDF version (rel="alternate"
> media="print")\n--><!-- style info, which may include CDATA sections
> --><!-- script statements, which may include CDATA sections --><!--
> alternate content container for non script-based rendering --><!--
> ======================= Frames
> =======================================--><!-- inline subwindow --
> ><!-- alternate content container for non frame-based rendering --
> ><!--=================== Document Body
> ====================================--><!-- generic language/style
> container --><!--=================== Paragraphs
> =======================================--><!--===================
> Headings =========================================--><!--\n There
> are six levels of headings from h1 (the most important)\n to h6
> (the least important).\n--><!--=================== Lists
> ============================================--><!-- Unordered list
> bullet styles --><!-- Unordered list --><!-- Ordered list numbering
> style\n\n 1 arabic numbers 1, 2, 3, ...\n a lower
> alpha a, b, c, ...\n A upper alpha A, B, C, ...
> \n i lower roman i, ii, iii, ...\n I upper
> roman I, II, III, ...\n\n The style is applied to the
> sequence number which by default\n is reset to 1 for the first
> list item in an ordered list.\n--><!-- Ordered (numbered) list --
> ><!-- single column list (DEPRECATED) --><!-- multiple column list
> (DEPRECATED) --><!-- LIStyle is constrained to: "(%ULStyle;|
> %OLStyle;)" --><!-- list item --><!-- definition lists - dt for
> term, dd for its definition --><!--=================== Address
> ==========================================--><!-- information on
> author --><!--=================== Horizontal Rule
> ==================================--><!--===================
> Preformatted Text ================================--><!-- content is
> %Inline; excluding \n "img|object|applet|big|small|sub|sup|
> font|basefont" --><!--=================== Block-like Quotes
> ================================--><!--=================== Text
> alignment ===================================--><!-- center content
> --><!--=================== Inserted/Deleted Text
> ============================--><!--\n ins/del are allowed in block
> and inline content, but its\n inappropriate to include block
> content within an ins element\n occurring in inline content.\n--
> ><!--================== The Anchor Element
> ================================--><!-- content is %Inline; except
> that anchors shouldn't be nested --><!--===================== Inline
> Elements ================================--><!-- generic language/
> style container --><!-- I18N BiDi over-ride --><!-- forced line
> break --><!-- emphasis --><!-- strong emphasis --><!-- definitional
> --><!-- program code --><!-- sample --><!-- something user would
> type --><!-- variable --><!-- citation --><!-- abbreviation --><!--
> acronym --><!-- inlined quote --><!-- subscript --><!-- superscript
> --><!-- fixed pitch font --><!-- italic font --><!-- bold font --
> ><!-- bigger font --><!-- smaller font --><!-- underline --><!--
> strike-through --><!-- strike-through --><!-- base font size --><!--
> local change to font --><!--==================== Object
> ======================================--><!--\n object is used to
> embed objects as part of HTML pages.\n param elements should
> precede other content. Parameters\n can also be expressed as
> attribute/value pairs on the\n object element itself when brevity
> is desired.\n--><!--\n param is used to supply a named property
> value.\n In XML it would seem natural to follow RDF and support an
> \n abbreviated syntax where the param elements are replaced\n by
> attribute value pairs on the object start tag.\n--><!--
> =================== Java applet ==================================--
> ><!--\n One of code or object attributes must be present.\n Place
> param elements before other content.\n--><!--===================
> Images ===========================================--><!--\n To
> avoid accessibility problems for people who aren't\n able to see
> the image, you should provide a text\n description using the alt
> and longdesc attributes.\n In addition, avoid the use of server-
> side image maps.\n--><!-- usemap points to a map element which may
> be in this document\n or an external document, although the latter
> is not widely supported --><!--================== Client-side image
> maps ============================--><!-- These can be placed in the
> same document or grouped in a\n separate document although this
> isn't yet widely supported --><!--================ Forms
> ===============================================--><!-- forms
> shouldn't be nested --><!--\n Each label must not contain more than
> ONE field\n Label elements shouldn't be nested.\n--><!-- the name
> attribute is required for all but submit & reset --><!-- form
> control --><!-- option selector --><!-- option group --><!--
> selectable choice --><!-- multi-line text field --><!--\n The
> fieldset element is used to group form fields.\n Only one legend
> element should occur in the content\n and if present should only be
> preceded by whitespace.\n--><!-- fieldset label --><!--\n Content is
> %Flow; excluding a, form, form controls, iframe\n--><!-- push button
> --><!-- single-line text input control (DEPRECATED) --><!--
> ======================= Tables
> =======================================--><!-- Derived from IETF
> HTML table standard, see [RFC1942] --><!--\n The border attribute
> sets the thickness of the frame around the\n table. The default
> units are screen pixels.\n\n The frame attribute specifies which
> parts of the frame around\n the table should be rendered. The values
> are not the same as\n CALS to avoid a name clash with the valign
> attribute.\n--><!--\n The rules attribute defines which rules to
> draw between cells:\n\n If rules is absent then assume:\n "none"
> if border is absent or border="0" otherwise "all"\n--><!--
> horizontal placement of table relative to document --><!--
> horizontal alignment attributes for cell contents\n\n char
> alignment char, e.g. char=':'\n charoff offset for alignment
> char\n--><!-- vertical alignment attributes for cell contents --><!--
> \ncolgroup groups a set of col elements. It allows you to group
> \nseveral semantically related columns together.\n--><!--\n col
> elements define the alignment properties for cells in\n one or more
> columns.\n\n The width attribute specifies the width of the columns,
> e.g.\n\n width=64 width in screen pixels\n
> width=0.5* relative width of 0.5\n\n The span attribute causes
> the attributes of one\n col element to apply to more than one column.
> \n--><!--\n Use thead to duplicate headers when breaking table
> \n across page boundaries, or for static headers when\n tbody
> sections are rendered in scrolling panel.\n\n Use tfoot to
> duplicate footers when breaking table\n across page boundaries,
> or for static footers when\n tbody sections are rendered in
> scrolling panel.\n\n Use multiple tbody sections when rules are
> needed\n between groups of table rows.\n--><!-- Scope is simpler
> than headers attribute for common tables --><!-- th is for headers,
> td for data and for cells acting as both -->]>\r\n<html xmlns="http://www.w3.org/1999/xhtml
> " xml:lang="en">\r\n\t<head>\r\n\t\t<title>Test of lazy DOM builder</
> title>\r\n\t</head>\r\n\t<body>\r\n\t\t<p>Some text here is ok</p>\r
> \n\t</body>\r\n</html>
>
> Cheers!
>
> -Mike
Re: Identity transformation producing invalid XML
Posted by Mike Strauch <mi...@hannonhill.com>.
Hey Henry,
Thanks for the reply. I actually went back and tried this same
transformation with Xalan 2.6 and got similar results (I hadn't realized
that we were doing our own filtering previously that was removing all of
the dtd content), but there are some differences between output. I am
curious as to why this would have changed.
Here are a few comparisons between 2.6 and 2.7.1
2.6 output:
<?xml version="1.0" encoding="UTF-8"?><!--================== Imported
Names ====================================--> ...
(notice a comment starts immediately after the xml declaration, the
doctype declaration actually comes after all of the dtd stuff right
before the html content)
2.7.1 output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" > ...
<!--================== Imported Names
====================================--> ...
(here the doctype comes after the xml declaration and has entity
declarations and comments inside the doctype declaration)
2.6 output:
Contains no entity declarations
2.7.1
Contains
<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" >
<!ENTITY %HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" >
<!ENTITY %HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" >
Those are the major differences. Again, I'm just curious as to why this
has changed.
Thanks,
Mike
Henry Zongaro wrote:
>
> Hi, Mike.
>
> Mike Strauch <mi...@hannonhill.com> wrote on 2008-04-16
> 03:43:23 PM:
> > I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
> > following issues:
> >
> > I'm using a TransformerIdentityImpl to transform the html below and the
> > result includes a lot of information that I believe is coming from the
> > dtd associated with the doctype, and I'm not sure why it is being
> > included. This alone is not my only concern. When I attempt to
> > validate the result as xml I receive the following error:
> >
> > "White space is required after "<!ENTITY" in the entity declaration."
>
> My understanding is that the description of the identity Transformer
> that's created by calling the zero-argument newTransformer() method is
> intentionally vague in order to allow an implementation to preserve
> more of the source document than would be possible with an XSLT
> identity stylesheet. In particular, Xalan-J attempts to preserve as
> much information about the DTD as it can.
>
> There are no specific output settings to suppress that information.
> To suppress the DTD you would either have to create an identity
> stylesheet and use that for the transformation or filter out the the
> DTD somewhere? Are you using a SAXSource?
>
> It looks like a bug that there is a space missing between the % and
> the name of the entity in the entity declaration. Could I ask you to
> open a bug report in Jira?
>
> Thanks,
>
> Henry
> ------------------------------------------------------------------
> Henry Zongaro
> XML Transformation & Query Development
> IBM Toronto Lab T/L 313-6044; Phone +1 905 413-6044
> mailto:zongaro@ca.ibm.com
Re: Identity transformation producing invalid XML
Posted by Henry Zongaro <zo...@ca.ibm.com>.
Hi, Mike.
Mike Strauch <mi...@hannonhill.com> wrote on 2008-04-16 03:43:23
PM:
> I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
> following issues:
>
> I'm using a TransformerIdentityImpl to transform the html below and the
> result includes a lot of information that I believe is coming from the
> dtd associated with the doctype, and I'm not sure why it is being
> included. This alone is not my only concern. When I attempt to
> validate the result as xml I receive the following error:
>
> "White space is required after "<!ENTITY" in the entity declaration."
My understanding is that the description of the identity Transformer
that's created by calling the zero-argument newTransformer() method is
intentionally vague in order to allow an implementation to preserve more
of the source document than would be possible with an XSLT identity
stylesheet. In particular, Xalan-J attempts to preserve as much
information about the DTD as it can.
There are no specific output settings to suppress that information. To
suppress the DTD you would either have to create an identity stylesheet
and use that for the transformation or filter out the the DTD somewhere?
Are you using a SAXSource?
It looks like a bug that there is a space missing between the % and the
name of the entity in the entity declaration. Could I ask you to open a
bug report in Jira?
Thanks,
Henry
------------------------------------------------------------------
Henry Zongaro
XML Transformation & Query Development
IBM Toronto Lab T/L 313-6044; Phone +1 905 413-6044
mailto:zongaro@ca.ibm.com