You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xalan.apache.org by Jenny Brown <sk...@gmail.com> on 2008/04/17 03:27:44 UTC

Trouble exporting HTML from a DOM in memory

I have an html DOM tree in memory (after having passed html through
JTidy and NekoHTML for validation/cleanup) and I'm trying to write it
back out as valid html.  I'm using Xerces 2.9.1 and Xalan 2.7.1 with
Sun JDK 1.5.0_14.  I'm running this command line, so I have careful
control of the classpath.  The jars in my project are very minimal but
I wouldn't rule out conflicts with the JDK yet (though I'm not sure
how to check that).  The specific examples I'm having trouble with
follow, as well as the code I'm using to do the export.

The main situation I'm having trouble with is empty tags.  For
instance... my input file contains:
<P>This is some <STRONG></STRONG> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
after it.</P>

It gets into my in-memory dom tree okay.  But then when I try to use a
transformer to output the html, instead I get this which Firefox
chokes on:
<P>This is some <STRONG/> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>

(Firefox sees <STRONG/> and thinks it means <STRONG> and sees
<TEXTAREA/> and thinks it means <TEXTAREA>  ... which leaves the tags
hanging open and they boldface or otherwise consume the rest of the
page; on other tags such as div it may even make the whole page
un-renderable.)


So here's what I'm doing for export code, and my intention is simply
to produce valid HTML that a browser can render later.
============
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

StringWriter sw = new StringWriter();
try {
	 transformer.transform(new DOMSource(domDocument), new StreamResult(sw));
} catch (TransformerException te) {
	 return(te.toString());
}

============

(Yes, I do really actually want it in a string after that, not an
output stream... this will eventually be a module in the middle of a
handling pipeline)

So, I'm trying to tell it to give me html, but what I get is a
document that contains xml-like empty tags wherever the tag was empty,
which results in browser bombs, and starts with:
<HTML xmlns="http://www.w3.org/1999/xhtml" lang="en">


I'm sure there's something I'm missing here (configuration? other
setup?), but I'm not sure what.  Thanks for your help.


Jenny Brown

Re: Trouble exporting HTML from a DOM in memory

Posted by ke...@us.ibm.com.

I'm fairly sure that if you use a "real" identity stylesheet to create 
your transformer, with xsl:output set to HTML mode, you'll get the results 
you intended. You might want to try that approach, both as a probable 
workaround and to help isolate the problem.

(Many moons ago, there were some Issues with the built-in default 
transformation. It wasn't going through exactly the same code paths, and 
it had a few bugs that the real XSLT processor didn't. I thought we 
tracked those down, but I wouldn't be shocked if there's still a lurking 
horror or two.)

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Trouble exporting HTML from a DOM in memory

Posted by Brian Minchau <mi...@ca.ibm.com>.

Hi Jenny.

Things are looking a little funny here. First you do this:
transformer.setOutputProperty(OutputKeys.METHOD, "html");

but you also do this:
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");


If you are really getting the html output method then there is no reason to
omit the xml declaration.

If an element has a name that is recognized as an html element ( HTML,
HEAD, BR, STRONG, TEXTAREA ...) and that name is in no namespace, then the
element is treated as an HTML element.
Looking at the serializer code in Xalan-J 2.7.1 the mininimized form is
never emitted for html elements. However one can have XML elements within
HTML.

An element is treated as an XML element if it is in namespace, or if its
name is not one of the recognized html elements.

In your case it would appear that either the output method is XML, and all
elements are treated as XML, hence the minimized form, or even though these
elements have the right names, they are not treated as HTML because they
are in a default namespace.

 I'm leaning towards the possibility that the effective output method is
xml.  What does happen when you don't ask for the xml header to be omitted?
Does it come out?

I think you've created an identity transformation to do you serialization
and it is too late to set the output method.   Try setting the output
method on the object that you get from TransformerFactory.newInstance(),
before you call newTransformer() on it.




- Brian



                                                                           
             "Jenny Brown"                                                 
             <skywind@gmail.co                                             
             m>                                                         To 
                                       xalan-j-users@xml.apache.org        
             04/16/2008 09:27                                           cc 
             PM                                                            
                                                                   Subject 
                                       Trouble exporting HTML from a DOM   
                                       in memory                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




I have an html DOM tree in memory (after having passed html through
JTidy and NekoHTML for validation/cleanup) and I'm trying to write it
back out as valid html.  I'm using Xerces 2.9.1 and Xalan 2.7.1 with
Sun JDK 1.5.0_14.  I'm running this command line, so I have careful
control of the classpath.  The jars in my project are very minimal but
I wouldn't rule out conflicts with the JDK yet (though I'm not sure
how to check that).  The specific examples I'm having trouble with
follow, as well as the code I'm using to do the export.

The main situation I'm having trouble with is empty tags.  For
instance... my input file contains:
<P>This is some <STRONG></STRONG> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
after it.</P>

It gets into my in-memory dom tree okay.  But then when I try to use a
transformer to output the html, instead I get this which Firefox
chokes on:
<P>This is some <STRONG/> paragraph text.</P>
<P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>

(Firefox sees <STRONG/> and thinks it means <STRONG> and sees
<TEXTAREA/> and thinks it means <TEXTAREA>  ... which leaves the tags
hanging open and they boldface or otherwise consume the rest of the
page; on other tags such as div it may even make the whole page
un-renderable.)


So here's what I'm doing for export code, and my intention is simply
to produce valid HTML that a browser can render later.
============
Transformer transformer =
TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");

StringWriter sw = new StringWriter();
try {
              transformer.transform(new DOMSource(domDocument), new
StreamResult(sw));
} catch (TransformerException te) {
              return(te.toString());
}

============

(Yes, I do really actually want it in a string after that, not an
output stream... this will eventually be a module in the middle of a
handling pipeline)

So, I'm trying to tell it to give me html, but what I get is a
document that contains xml-like empty tags wherever the tag was empty,
which results in browser bombs, and starts with:
<HTML xmlns="http://www.w3.org/1999/xhtml" lang="en">


I'm sure there's something I'm missing here (configuration? other
setup?), but I'm not sure what.  Thanks for your help.


Jenny Brown

Re: Trouble exporting HTML from a DOM in memory

Posted by Henry Zongaro <zo...@ca.ibm.com>.

Hi, Jenny.

"Jenny Brown" <sk...@gmail.com> wrote on 2008-04-16 09:27:44 PM:
> The main situation I'm having trouble with is empty tags.  For
> instance... my input file contains:
> <P>This is some <STRONG></STRONG> paragraph text.</P>
> <P>This is a textarea.  <TEXTAREA name="foo"></TEXTAREA>  It has text
> after it.</P>
> 
> It gets into my in-memory dom tree okay.  But then when I try to use a
> transformer to output the html, instead I get this which Firefox
> chokes on:
> <P>This is some <STRONG/> paragraph text.</P>
> <P>This is a textarea.  <TEXTAREA name="foo"/> It has text after it.</P>
>
> [Snip]
>
> Transformer transformer = 
TransformerFactory.newInstance().newTransformer();
> transformer.setOutputProperty(OutputKeys.METHOD, "html");
> transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html");
> transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
> transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
>
> [Snip]
>
> So, I'm trying to tell it to give me html, but what I get is a
> document that contains xml-like empty tags wherever the tag was empty,
> which results in browser bombs, and starts with:
> <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en">

I think this is the key.  You have specified that you want to use the html 
output method, but your output is really xhtml.  Because your output is in 
an XML namespace, the serializer is required to serialize the output as 
XML, despite the fact that you've used the html output method.  However, 
XHTML has to adhere to certain lexical conventions in order to be 
correctly displayed in a browser that ordinary XML does not have to adhere 
to.

XSLT 1.0 does not define an xhtml output method, but Xalan-J does allow 
you to give it a clue that what you're serializing is really XHTML.  If 
you add the following output property, the serializer will emit empty tags 
using a space before the trailing /> - thus, <STRONG />

transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD 
XHTML 1.0 Transitional//EN");

That will probably help with a tag like <br> which is always supposed to 
be empty - it will be serialized as <br /> - but probably not with STRONG 
and TEXTAREA which happen to have no content in your DOM tree, but 
ordinarily would have content.  They really should be serialized as 
<STRONG></STRONG> rather than <STRONG />.  This issue has previously been 
reported as JIra issue XALANJ-1906.[1]

In the meanwhile, you probably have a couple of options for working around 
this issue:  one would be recreate the DOM tree using elements that are in 
no namespace rather than being in the XHTML namespace - then the html 
output method would work properly; another would be search the DOM tree 
looking for elements that ordinarily have content that are actually empty, 
and give them a single whitespace node child or remove them from the tree 
entirely.  You could also write XSLT stylesheets to implement any of those 
work-arounds; let us know if you'd like an example.

Thanks,

Henry
[1] http://issues.apache.org/jira/browse/XALANJ-1906
------------------------------------------------------------------
Henry Zongaro
XML Transformation & Query Development
IBM Toronto Lab   T/L 313-6044;  Phone +1 905 413-6044
mailto:zongaro@ca.ibm.com