You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xalan.apache.org by Mike Strauch <mi...@hannonhill.com> on 2008/04/16 21:43:23 UTC

Identity transformation producing invalid XML

Hello,

I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the 
following issues:

I'm using a TransformerIdentityImpl to transform the html below and the 
result includes a lot of information that I believe is coming from the 
dtd associated with the doctype, and I'm not sure why it is being 
included.  This alone is not my only concern.  When I attempt to 
validate the result as xml I receive the following error:

"White space is required after "<!ENTITY" in the entity declaration."

I am setting the following output properties on the transformer itself:

omit-xml-declaration: no
standalone: no
method: xml

I have fiddled around with these output properties and have not been 
able to acquire a result that does not include all of the doctype 
information.  Is there something I am missing?

Original:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Test of lazy DOM builder</title>
    </head>
    <body>
        <p>Some text here is ok</p>
    </body>
</html>

Transformed:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html 
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [\r\n<!ENTITY 
%HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" >\r\n<!ENTITY 
%HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" 
 >\r\n<!ENTITY %HTMLspecial PUBLIC "-//W3C//ENTITIES Special for 
XHTML//EN" >\r\n<!--================== Imported Names 
====================================--><!-- media type, as per [RFC2045] 
--><!-- comma-separated list of media types, as per [RFC2045] --><!-- a 
character encoding, as per [RFC2045] --><!-- a space separated list of 
character encodings, as per [RFC2045] --><!-- a language code, as per 
[RFC3066] --><!-- a single character, as per section 2.2 of [XML] 
--><!-- one or more digits --><!-- space-separated list of link types 
--><!-- single or comma-separated list of media descriptors --><!-- a 
Uniform Resource Identifier, see [RFC2396] --><!-- a space separated 
list of Uniform Resource Identifiers --><!-- date and time information. 
ISO date format --><!-- script expression --><!-- style sheet data 
--><!-- used for titles etc. --><!-- render in this frame --><!-- nn for 
pixels or nn% for percentage length --><!-- pixel, percentage, or 
relative --><!-- integer representing length in pixels --><!-- these are 
used for image maps --><!-- comma separated list of lengths --><!-- used 
for object, applet, img, input and iframe --><!-- a color using sRGB: 
#RRGGBB as Hex values --><!-- There are also 16 widely known color names 
with their sRGB values:\n\n    Black  = #000000    Green  = #008000\n    
Silver = #C0C0C0    Lime   = #00FF00\n    Gray   = #808080    Olive  = 
#808000\n    White  = #FFFFFF    Yellow = #FFFF00\n    Maroon = 
#800000    Navy   = #000080\n    Red    = #FF0000    Blue   = 
#0000FF\n    Purple = #800080    Teal   = #008080\n    Fuchsia= 
#FF00FF    Aqua   = #00FFFF\n--><!--=================== Generic 
Attributes ===============================--><!-- core attributes common 
to most elements\n  id       document-wide unique id\n  class    space 
separated list of classes\n  style    associated style info\n  title    
advisory title/amplification\n--><!-- internationalization attributes\n  
lang        language code (backwards compatible)\n  xml:lang    language 
code (as per XML 1.0 spec)\n  dir         direction for weak/neutral 
text\n--><!-- attributes for common UI events\n  onclick     a pointer 
button was clicked\n  ondblclick  a pointer button was double clicked\n  
onmousedown a pointer button was pressed down\n  onmouseup   a pointer 
button was released\n  onmousemove a pointer was moved onto the 
element\n  onmouseout  a pointer was moved away from the element\n  
onkeypress  a key was pressed and released\n  onkeydown   a key was 
pressed down\n  onkeyup     a key was released\n--><!-- attributes for 
elements that can get the focus\n  accesskey   accessibility key 
character\n  tabindex    position in tabbing order\n  onfocus     the 
element got the focus\n  onblur      the element lost the focus\n--><!-- 
text alignment for p, div, h1-h6. The default is\n     align="left" for 
ltr headings, "right" for rtl --><!--=================== Text Elements 
====================================--><!-- these can occur at block or 
inline level --><!-- these can only occur at block level --><!-- 
%Inline; covers inline or "text-level" elements 
--><!--================== Block level elements 
==============================--><!-- %Flow; mixes block and inline and 
is used for list items etc. --><!--================== Content models for 
exclusions =====================--><!-- a elements use %Inline; 
excluding a --><!-- pre uses %Inline excluding img, object, applet, big, 
small,\n     font, or basefont --><!-- form uses %Flow; excluding form 
--><!-- button uses %Flow; but excludes a, form, form controls, iframe 
--><!--================ Document Structure 
==================================--><!-- the namespace URI designates 
the document profile --><!--================ Document Head 
=======================================--><!-- content model is 
%head.misc; combined with a single\n     title and an optional base 
element in any order --><!-- The title element is not considered part of 
the flow of text.\n       It should be displayed, for example as the 
page header or\n       window title. Exactly one title is required per 
document.\n    --><!-- document base URI --><!-- generic metainformation 
--><!--\n  Relationship values can be used in principle:\n\n   a) for 
document specific toolbars/menus when used\n      with the link element 
in document head e.g.\n        start, contents, previous, next, index, 
end, help\n   b) to link to a separate style sheet 
(rel="stylesheet")\n   c) to make a link to a script (rel="script")\n   
d) by stylesheets to control how collections of\n      html nodes are 
rendered into printed documents\n   e) to make a link to a printable 
version of this document\n      e.g. a PostScript or PDF version 
(rel="alternate" media="print")\n--><!-- style info, which may include 
CDATA sections --><!-- script statements, which may include CDATA 
sections --><!-- alternate content container for non script-based 
rendering --><!--======================= Frames 
=======================================--><!-- inline subwindow --><!-- 
alternate content container for non frame-based rendering 
--><!--=================== Document Body 
====================================--><!-- generic language/style 
container --><!--=================== Paragraphs 
=======================================--><!--=================== 
Headings =========================================--><!--\n  There are 
six levels of headings from h1 (the most important)\n  to h6 (the least 
important).\n--><!--=================== Lists 
============================================--><!-- Unordered list 
bullet styles --><!-- Unordered list --><!-- Ordered list numbering 
style\n\n    1   arabic numbers      1, 2, 3, ...\n    a   lower 
alpha         a, b, c, ...\n    A   upper alpha         A, B, C, 
...\n    i   lower roman         i, ii, iii, ...\n    I   upper 
roman         I, II, III, ...\n\n    The style is applied to the 
sequence number which by default\n    is reset to 1 for the first list 
item in an ordered list.\n--><!-- Ordered (numbered) list --><!-- single 
column list (DEPRECATED) --><!-- multiple column list (DEPRECATED) 
--><!-- LIStyle is constrained to: "(%ULStyle;|%OLStyle;)" --><!-- list 
item --><!-- definition lists - dt for term, dd for its definition 
--><!--=================== Address 
==========================================--><!-- information on author 
--><!--=================== Horizontal Rule 
==================================--><!--=================== 
Preformatted Text ================================--><!-- content is 
%Inline; excluding \n        
"img|object|applet|big|small|sub|sup|font|basefont" 
--><!--=================== Block-like Quotes 
================================--><!--=================== Text 
alignment ===================================--><!-- center content 
--><!--=================== Inserted/Deleted Text 
============================--><!--\n  ins/del are allowed in block and 
inline content, but its\n  inappropriate to include block content within 
an ins element\n  occurring in inline 
content.\n--><!--================== The Anchor Element 
================================--><!-- content is %Inline; except that 
anchors shouldn't be nested --><!--===================== Inline Elements 
================================--><!-- generic language/style container 
--><!-- I18N BiDi over-ride --><!-- forced line break --><!-- emphasis 
--><!-- strong emphasis --><!-- definitional --><!-- program code 
--><!-- sample --><!-- something user would type --><!-- variable 
--><!-- citation --><!-- abbreviation --><!-- acronym --><!-- inlined 
quote --><!-- subscript --><!-- superscript --><!-- fixed pitch font 
--><!-- italic font --><!-- bold font --><!-- bigger font --><!-- 
smaller font --><!-- underline --><!-- strike-through --><!-- 
strike-through --><!-- base font size --><!-- local change to font 
--><!--==================== Object 
======================================--><!--\n  object is used to embed 
objects as part of HTML pages.\n  param elements should precede other 
content. Parameters\n  can also be expressed as attribute/value pairs on 
the\n  object element itself when brevity is desired.\n--><!--\n  param 
is used to supply a named property value.\n  In XML it would seem 
natural to follow RDF and support an\n  abbreviated syntax where the 
param elements are replaced\n  by attribute value pairs on the object 
start tag.\n--><!--=================== Java applet 
==================================--><!--\n  One of code or object 
attributes must be present.\n  Place param elements before other 
content.\n--><!--=================== Images 
===========================================--><!--\n   To avoid 
accessibility problems for people who aren't\n   able to see the image, 
you should provide a text\n   description using the alt and longdesc 
attributes.\n   In addition, avoid the use of server-side image 
maps.\n--><!-- usemap points to a map element which may be in this 
document\n  or an external document, although the latter is not widely 
supported --><!--================== Client-side image maps 
============================--><!-- These can be placed in the same 
document or grouped in a\n     separate document although this isn't yet 
widely supported --><!--================ Forms 
===============================================--><!-- forms shouldn't 
be nested --><!--\n  Each label must not contain more than ONE field\n  
Label elements shouldn't be nested.\n--><!-- the name attribute is 
required for all but submit & reset --><!-- form control --><!-- option 
selector --><!-- option group --><!-- selectable choice --><!-- 
multi-line text field --><!--\n  The fieldset element is used to group 
form fields.\n  Only one legend element should occur in the content\n  
and if present should only be preceded by whitespace.\n--><!-- fieldset 
label --><!--\n Content is %Flow; excluding a, form, form controls, 
iframe\n--><!-- push button --><!-- single-line text input control 
(DEPRECATED) --><!--======================= Tables 
=======================================--><!-- Derived from IETF HTML 
table standard, see [RFC1942] --><!--\n The border attribute sets the 
thickness of the frame around the\n table. The default units are screen 
pixels.\n\n The frame attribute specifies which parts of the frame 
around\n the table should be rendered. The values are not the same as\n 
CALS to avoid a name clash with the valign attribute.\n--><!--\n The 
rules attribute defines which rules to draw between cells:\n\n If rules 
is absent then assume:\n     "none" if border is absent or border="0" 
otherwise "all"\n--><!-- horizontal placement of table relative to 
document --><!-- horizontal alignment attributes for cell contents\n\n  
char        alignment char, e.g. char=':'\n  charoff     offset for 
alignment char\n--><!-- vertical alignment attributes for cell contents 
--><!--\ncolgroup groups a set of col elements. It allows you to 
group\nseveral semantically related columns together.\n--><!--\n col 
elements define the alignment properties for cells in\n one or more 
columns.\n\n The width attribute specifies the width of the columns, 
e.g.\n\n     width=64        width in screen pixels\n     
width=0.5*      relative width of 0.5\n\n The span attribute causes the 
attributes of one\n col element to apply to more than one 
column.\n--><!--\n    Use thead to duplicate headers when breaking 
table\n    across page boundaries, or for static headers when\n    tbody 
sections are rendered in scrolling panel.\n\n    Use tfoot to duplicate 
footers when breaking table\n    across page boundaries, or for static 
footers when\n    tbody sections are rendered in scrolling panel.\n\n    
Use multiple tbody sections when rules are needed\n    between groups of 
table rows.\n--><!-- Scope is simpler than headers attribute for common 
tables --><!-- th is for headers, td for data and for cells acting as 
both -->]>\r\n<html xmlns="http://www.w3.org/1999/xhtml" 
xml:lang="en">\r\n\t<head>\r\n\t\t<title>Test of lazy DOM 
builder</title>\r\n\t</head>\r\n\t<body>\r\n\t\t<p>Some text here is 
ok</p>\r\n\t</body>\r\n</html>

Cheers!

-Mike

Re: Identity transformation producing invalid XML

Posted by Bradley Wagner <br...@hannonhill.com>.
Mike,

When I tried this, the error that I got seemed to indicate that the  
entities output in the result of the transformation should have had a  
space between % sign and the word HTML.

I too am curious about this one.

Bradley

On Apr 16, 2008, at 3:43 PM, Mike Strauch wrote:

> Hello,
>
> I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the  
> following issues:
>
> I'm using a TransformerIdentityImpl to transform the html below and  
> the result includes a lot of information that I believe is coming  
> from the dtd associated with the doctype, and I'm not sure why it is  
> being included.  This alone is not my only concern.  When I attempt  
> to validate the result as xml I receive the following error:
>
> "White space is required after "<!ENTITY" in the entity declaration."
>
> I am setting the following output properties on the transformer  
> itself:
>
> omit-xml-declaration: no
> standalone: no
> method: xml
>
> I have fiddled around with these output properties and have not been  
> able to acquire a result that does not include all of the doctype  
> information.  Is there something I am missing?
>
> Original:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html
> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
>   <head>
>       <title>Test of lazy DOM builder</title>
>   </head>
>   <body>
>       <p>Some text here is ok</p>
>   </body>
> </html>
>
> Transformed:
>
> <?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html  
> PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd 
> " [\r\n<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for  
> XHTML//EN" >\r\n<!ENTITY %HTMLsymbol PUBLIC "-//W3C//ENTITIES  
> Symbols for XHTML//EN" >\r\n<!ENTITY %HTMLspecial PUBLIC "-//W3C// 
> ENTITIES Special for XHTML//EN" >\r\n<!--================== Imported  
> Names ====================================--><!-- media type, as per  
> [RFC2045] --><!-- comma-separated list of media types, as per  
> [RFC2045] --><!-- a character encoding, as per [RFC2045] --><!-- a  
> space separated list of character encodings, as per [RFC2045] -- 
> ><!-- a language code, as per [RFC3066] --><!-- a single character,  
> as per section 2.2 of [XML] --><!-- one or more digits --><!-- space- 
> separated list of link types --><!-- single or comma-separated list  
> of media descriptors --><!-- a Uniform Resource Identifier, see  
> [RFC2396] --><!-- a space separated list of Uniform Resource  
> Identifiers --><!-- date and time information. ISO date format -- 
> ><!-- script expression --><!-- style sheet data --><!-- used for  
> titles etc. --><!-- render in this frame --><!-- nn for pixels or nn 
> % for percentage length --><!-- pixel, percentage, or relative -- 
> ><!-- integer representing length in pixels --><!-- these are used  
> for image maps --><!-- comma separated list of lengths --><!-- used  
> for object, applet, img, input and iframe --><!-- a color using  
> sRGB: #RRGGBB as Hex values --><!-- There are also 16 widely known  
> color names with their sRGB values:\n\n    Black  = #000000     
> Green  = #008000\n    Silver = #C0C0C0    Lime   = #00FF00\n     
> Gray   = #808080    Olive  = #808000\n    White  = #FFFFFF    Yellow  
> = #FFFF00\n    Maroon = #800000    Navy   = #000080\n    Red    =  
> #FF0000    Blue   = #0000FF\n    Purple = #800080    Teal   =  
> #008080\n    Fuchsia= #FF00FF    Aqua   = #00FFFF\n--><!-- 
> =================== Generic Attributes  
> ===============================--><!-- core attributes common to  
> most elements\n  id       document-wide unique id\n  class    space  
> separated list of classes\n  style    associated style info\n   
> title    advisory title/amplification\n--><!-- internationalization  
> attributes\n  lang        language code (backwards compatible)\n   
> xml:lang    language code (as per XML 1.0 spec)\n  dir          
> direction for weak/neutral text\n--><!-- attributes for common UI  
> events\n  onclick     a pointer button was clicked\n  ondblclick  a  
> pointer button was double clicked\n  onmousedown a pointer button  
> was pressed down\n  onmouseup   a pointer button was released\n   
> onmousemove a pointer was moved onto the element\n  onmouseout  a  
> pointer was moved away from the element\n  onkeypress  a key was  
> pressed and released\n  onkeydown   a key was pressed down\n   
> onkeyup     a key was released\n--><!-- attributes for elements that  
> can get the focus\n  accesskey   accessibility key character\n   
> tabindex    position in tabbing order\n  onfocus     the element got  
> the focus\n  onblur      the element lost the focus\n--><!-- text  
> alignment for p, div, h1-h6. The default is\n     align="left" for  
> ltr headings, "right" for rtl --><!--=================== Text  
> Elements ====================================--><!-- these can occur  
> at block or inline level --><!-- these can only occur at block level  
> --><!-- %Inline; covers inline or "text-level" elements --><!-- 
> ================== Block level elements  
> ==============================--><!-- %Flow; mixes block and inline  
> and is used for list items etc. --><!--================== Content  
> models for exclusions =====================--><!-- a elements use  
> %Inline; excluding a --><!-- pre uses %Inline excluding img, object,  
> applet, big, small,\n     font, or basefont --><!-- form uses %Flow;  
> excluding form --><!-- button uses %Flow; but excludes a, form, form  
> controls, iframe --><!--================ Document Structure  
> ==================================--><!-- the namespace URI  
> designates the document profile --><!--================ Document  
> Head =======================================--><!-- content model is  
> %head.misc; combined with a single\n     title and an optional base  
> element in any order --><!-- The title element is not considered  
> part of the flow of text.\n       It should be displayed, for  
> example as the page header or\n       window title. Exactly one  
> title is required per document.\n    --><!-- document base URI -- 
> ><!-- generic metainformation --><!--\n  Relationship values can be  
> used in principle:\n\n   a) for document specific toolbars/menus  
> when used\n      with the link element in document head e.g. 
> \n        start, contents, previous, next, index, end, help\n   b)  
> to link to a separate style sheet (rel="stylesheet")\n   c) to make  
> a link to a script (rel="script")\n   d) by stylesheets to control  
> how collections of\n      html nodes are rendered into printed  
> documents\n   e) to make a link to a printable version of this  
> document\n      e.g. a PostScript or PDF version (rel="alternate"  
> media="print")\n--><!-- style info, which may include CDATA sections  
> --><!-- script statements, which may include CDATA sections --><!--  
> alternate content container for non script-based rendering --><!-- 
> ======================= Frames  
> =======================================--><!-- inline subwindow -- 
> ><!-- alternate content container for non frame-based rendering -- 
> ><!--=================== Document Body  
> ====================================--><!-- generic language/style  
> container --><!--=================== Paragraphs  
> =======================================--><!--===================  
> Headings =========================================--><!--\n  There  
> are six levels of headings from h1 (the most important)\n  to h6  
> (the least important).\n--><!--=================== Lists  
> ============================================--><!-- Unordered list  
> bullet styles --><!-- Unordered list --><!-- Ordered list numbering  
> style\n\n    1   arabic numbers      1, 2, 3, ...\n    a   lower  
> alpha         a, b, c, ...\n    A   upper alpha         A, B, C, ... 
> \n    i   lower roman         i, ii, iii, ...\n    I   upper  
> roman         I, II, III, ...\n\n    The style is applied to the  
> sequence number which by default\n    is reset to 1 for the first  
> list item in an ordered list.\n--><!-- Ordered (numbered) list -- 
> ><!-- single column list (DEPRECATED) --><!-- multiple column list  
> (DEPRECATED) --><!-- LIStyle is constrained to: "(%ULStyle;| 
> %OLStyle;)" --><!-- list item --><!-- definition lists - dt for  
> term, dd for its definition --><!--=================== Address  
> ==========================================--><!-- information on  
> author --><!--=================== Horizontal Rule  
> ==================================--><!--===================  
> Preformatted Text ================================--><!-- content is  
> %Inline; excluding \n        "img|object|applet|big|small|sub|sup| 
> font|basefont" --><!--=================== Block-like Quotes  
> ================================--><!--=================== Text  
> alignment ===================================--><!-- center content  
> --><!--=================== Inserted/Deleted Text  
> ============================--><!--\n  ins/del are allowed in block  
> and inline content, but its\n  inappropriate to include block  
> content within an ins element\n  occurring in inline content.\n-- 
> ><!--================== The Anchor Element  
> ================================--><!-- content is %Inline; except  
> that anchors shouldn't be nested --><!--===================== Inline  
> Elements ================================--><!-- generic language/ 
> style container --><!-- I18N BiDi over-ride --><!-- forced line  
> break --><!-- emphasis --><!-- strong emphasis --><!-- definitional  
> --><!-- program code --><!-- sample --><!-- something user would  
> type --><!-- variable --><!-- citation --><!-- abbreviation --><!--  
> acronym --><!-- inlined quote --><!-- subscript --><!-- superscript  
> --><!-- fixed pitch font --><!-- italic font --><!-- bold font -- 
> ><!-- bigger font --><!-- smaller font --><!-- underline --><!--  
> strike-through --><!-- strike-through --><!-- base font size --><!--  
> local change to font --><!--==================== Object  
> ======================================--><!--\n  object is used to  
> embed objects as part of HTML pages.\n  param elements should  
> precede other content. Parameters\n  can also be expressed as  
> attribute/value pairs on the\n  object element itself when brevity  
> is desired.\n--><!--\n  param is used to supply a named property  
> value.\n  In XML it would seem natural to follow RDF and support an 
> \n  abbreviated syntax where the param elements are replaced\n  by  
> attribute value pairs on the object start tag.\n--><!-- 
> =================== Java applet ==================================-- 
> ><!--\n  One of code or object attributes must be present.\n  Place  
> param elements before other content.\n--><!--===================  
> Images ===========================================--><!--\n   To  
> avoid accessibility problems for people who aren't\n   able to see  
> the image, you should provide a text\n   description using the alt  
> and longdesc attributes.\n   In addition, avoid the use of server- 
> side image maps.\n--><!-- usemap points to a map element which may  
> be in this document\n  or an external document, although the latter  
> is not widely supported --><!--================== Client-side image  
> maps ============================--><!-- These can be placed in the  
> same document or grouped in a\n     separate document although this  
> isn't yet widely supported --><!--================ Forms  
> ===============================================--><!-- forms  
> shouldn't be nested --><!--\n  Each label must not contain more than  
> ONE field\n  Label elements shouldn't be nested.\n--><!-- the name  
> attribute is required for all but submit & reset --><!-- form  
> control --><!-- option selector --><!-- option group --><!--  
> selectable choice --><!-- multi-line text field --><!--\n  The  
> fieldset element is used to group form fields.\n  Only one legend  
> element should occur in the content\n  and if present should only be  
> preceded by whitespace.\n--><!-- fieldset label --><!--\n Content is  
> %Flow; excluding a, form, form controls, iframe\n--><!-- push button  
> --><!-- single-line text input control (DEPRECATED) --><!-- 
> ======================= Tables  
> =======================================--><!-- Derived from IETF  
> HTML table standard, see [RFC1942] --><!--\n The border attribute  
> sets the thickness of the frame around the\n table. The default  
> units are screen pixels.\n\n The frame attribute specifies which  
> parts of the frame around\n the table should be rendered. The values  
> are not the same as\n CALS to avoid a name clash with the valign  
> attribute.\n--><!--\n The rules attribute defines which rules to  
> draw between cells:\n\n If rules is absent then assume:\n     "none"  
> if border is absent or border="0" otherwise "all"\n--><!--  
> horizontal placement of table relative to document --><!--  
> horizontal alignment attributes for cell contents\n\n  char         
> alignment char, e.g. char=':'\n  charoff     offset for alignment  
> char\n--><!-- vertical alignment attributes for cell contents --><!-- 
> \ncolgroup groups a set of col elements. It allows you to group 
> \nseveral semantically related columns together.\n--><!--\n col  
> elements define the alignment properties for cells in\n one or more  
> columns.\n\n The width attribute specifies the width of the columns,  
> e.g.\n\n     width=64        width in screen pixels\n      
> width=0.5*      relative width of 0.5\n\n The span attribute causes  
> the attributes of one\n col element to apply to more than one column. 
> \n--><!--\n    Use thead to duplicate headers when breaking table 
> \n    across page boundaries, or for static headers when\n    tbody  
> sections are rendered in scrolling panel.\n\n    Use tfoot to  
> duplicate footers when breaking table\n    across page boundaries,  
> or for static footers when\n    tbody sections are rendered in  
> scrolling panel.\n\n    Use multiple tbody sections when rules are  
> needed\n    between groups of table rows.\n--><!-- Scope is simpler  
> than headers attribute for common tables --><!-- th is for headers,  
> td for data and for cells acting as both -->]>\r\n<html xmlns="http://www.w3.org/1999/xhtml 
> " xml:lang="en">\r\n\t<head>\r\n\t\t<title>Test of lazy DOM builder</ 
> title>\r\n\t</head>\r\n\t<body>\r\n\t\t<p>Some text here is ok</p>\r 
> \n\t</body>\r\n</html>
>
> Cheers!
>
> -Mike


Re: Identity transformation producing invalid XML

Posted by Mike Strauch <mi...@hannonhill.com>.
Hey Henry,

Thanks for the reply.  I actually went back and tried this same 
transformation with Xalan 2.6 and got similar results (I hadn't realized 
that we were doing our own filtering previously that was removing all of 
the dtd content), but there are some differences between output.  I am 
curious as to why this would have changed.

Here are a few comparisons between 2.6 and 2.7.1

2.6 output:

<?xml version="1.0" encoding="UTF-8"?><!--================== Imported 
Names ====================================--> ... 

(notice a comment starts immediately after the xml declaration, the 
doctype declaration actually comes after all of the dtd stuff right 
before the html content)

2.7.1 output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" > ... 
<!--================== Imported Names 
====================================--> ...

(here the doctype comes after the xml declaration and has entity 
declarations and comments inside the doctype declaration)

2.6 output:

Contains no entity declarations

2.7.1

Contains

<!ENTITY %HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" >
<!ENTITY %HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" >
<!ENTITY %HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" >

Those are the major differences.  Again, I'm just curious as to why this 
has changed.

Thanks,

Mike

Henry Zongaro wrote:
>
> Hi, Mike.
>
> Mike Strauch <mi...@hannonhill.com> wrote on 2008-04-16 
> 03:43:23 PM:
> > I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the
> > following issues:
> >
> > I'm using a TransformerIdentityImpl to transform the html below and the
> > result includes a lot of information that I believe is coming from the
> > dtd associated with the doctype, and I'm not sure why it is being
> > included.  This alone is not my only concern.  When I attempt to
> > validate the result as xml I receive the following error:
> >
> > "White space is required after "<!ENTITY" in the entity declaration."
>
> My understanding is that the description of the identity Transformer 
> that's created by calling the zero-argument newTransformer() method is 
> intentionally vague in order to allow an implementation to preserve 
> more of the source document than would be possible with an XSLT 
> identity stylesheet.  In particular, Xalan-J attempts to preserve as 
> much information about the DTD as it can.
>
> There are no specific output settings to suppress that information. 
>  To suppress the DTD you would either have to create an identity 
> stylesheet and use that for the transformation or filter out the the 
> DTD somewhere?  Are you using a SAXSource?
>
> It looks like a bug that there is a space missing between the % and 
> the name of the entity in the entity declaration.  Could I ask you to 
> open a bug report in Jira?
>
> Thanks,
>
> Henry
> ------------------------------------------------------------------
> Henry Zongaro
> XML Transformation & Query Development
> IBM Toronto Lab   T/L 313-6044;  Phone +1 905 413-6044
> mailto:zongaro@ca.ibm.com


Re: Identity transformation producing invalid XML

Posted by Henry Zongaro <zo...@ca.ibm.com>.
Hi, Mike.

Mike Strauch <mi...@hannonhill.com> wrote on 2008-04-16 03:43:23 
PM:
> I've recently upgraded Xalan from 2.6 to 2.7.1 and have run into the 
> following issues:
> 
> I'm using a TransformerIdentityImpl to transform the html below and the 
> result includes a lot of information that I believe is coming from the 
> dtd associated with the doctype, and I'm not sure why it is being 
> included.  This alone is not my only concern.  When I attempt to 
> validate the result as xml I receive the following error:
> 
> "White space is required after "<!ENTITY" in the entity declaration."

My understanding is that the description of the identity Transformer 
that's created by calling the zero-argument newTransformer() method is 
intentionally vague in order to allow an implementation to preserve more 
of the source document than would be possible with an XSLT identity 
stylesheet.  In particular, Xalan-J attempts to preserve as much 
information about the DTD as it can.

There are no specific output settings to suppress that information.  To 
suppress the DTD you would either have to create an identity stylesheet 
and use that for the transformation or filter out the the DTD somewhere? 
Are you using a SAXSource?

It looks like a bug that there is a space missing between the % and the 
name of the entity in the entity declaration.  Could I ask you to open a 
bug report in Jira?

Thanks,

Henry
------------------------------------------------------------------
Henry Zongaro
XML Transformation & Query Development
IBM Toronto Lab   T/L 313-6044;  Phone +1 905 413-6044
mailto:zongaro@ca.ibm.com