You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Ola Berg <ol...@arkitema.se> on 2002/09/05 10:22:41 UTC

Handling lousy HTML

At work I have to handle really bad written HTML (they used some really bad HTML generator):

<html>
<body>
<h1>Hello, world!</H1>

Hi there.
<p>
This is plain wrong.
<p>
But it works in certain browsers
</body>
</html>

I thought by using the HTMLGenerator, the Tidy-thing should take care of this. In my site map I have

<map:generate src=\"hello.html\" type=\"html\"/>
<map:serialize type=\"xhtml\"/>

But the server complains about the source \"hello.html\" being lousy html (containing unbalanced tags). 

1) Shouldn\'t tidy handle this? 

2)Or isn\'t tidy involved when I declare my sitemap as above?

If 1) is no, I plan to hack the functionality (going to the dev list first).

If 2) is no, I\'d like to know how to configure it to handle.

I use cocoon-2.0.2-bin.

TIA

/O

--------------------
ola.berg@arkitema.se
0733 - 99 99 17

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
On Friday 06 September 2002 16:30, Ola Berg wrote:
>. . .
> is it safe to believe that HTMLGenerator utilizes JTidy and 
> that JTidy fails, 

As Nicola told you, HTMLGenerator *does* use JTidy, as is clearly visible 
from the source code.

However, AFAIK JTidy offers many more options than what HTMLGenerator uses, 
and HTMLGenerator doesn't allow them to be set from the Cocoon configuration, 
they are hardcoded.

This is something that could be improved in Cocoon, allowing these options to 
be set so that a wider range of HTML documents can be used as input.

Did you test the latest JTidy on your input directly, outside of Cocoon?
If JTidy is unable to process it, HTMLGenerator won't either, and in that 
case it might be better to work with the JTidy team on improving JTidy 
instead of writing a new thing.

Hope this helps.

-- 
 Bertrand Delacrétaz (codeconsult.ch, jfor.org)

 buzzwords: XML, java, XSLT, cocoon, mentoring/teaching/coding.
 disclaimer: eternity is very long. mostly towards the end. get ready.











---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Ola Berg wrote:
> From: "Nicola Ken Barozzi" <ni...@apache.org>
> 
>>HTMLGenerator uses JTidy directly, without making assumptions itself.
>>If you can use JTidy to work for you, it should work - or can be easily 
>>made to work - with HTMLGenerator too.
> 
> 
> What do you mean? I can use JTidy on my system, whether Cocoon utilizes or not was my question to you, dear community ;-)

I meant if you can make it work from the commandline to generate the 
result you want, then also Cocoon can do it.

> Therefore I provided both the sitemap snippet as well as the test bhtml-document.
> I use the binary distribution of Cocoon 2.0.2 (where documentation says that this feature is enabled by default). And if it is not enabled by default, I haven't been able to find out how to enable it. 
> 
> Question restated: given my configuration and the bhtml document that fails, is it safe to believe that HTMLGenerator utilizes JTidy and that JTidy fails, or is it safe to believe that HTMLGenerator fails because it fails to utilize JTidy? 

I don't know, that'e why I made you that question.
USe JTidy outside of Cocoon to see if it works.
If it does, tell us how you did it, and we will patch the Cocoon 
HTMLGenerator to play nice.

> And if the latter is true, how could I tweak it so that JTidy will be utilized by HTMLGenerator? 

This is what HTMLGenerator does
():

             // Setup an instance of Tidy.
             Tidy tidy = new Tidy();
             tidy.setXmlOut(true);
             tidy.setXHTML(true);
             //Set Jtidy warnings on-off
             tidy.setShowWarnings(getLogger().isWarnEnabled());
             //Set Jtidy final result summary on-off
             tidy.setQuiet(!getLogger().isInfoEnabled());
             //Set Jtidy infos to a String (will be logged) instead of 
System.out
             StringWriter stringWriter = new StringWriter();
             PrintWriter errorWriter = new PrintWriter(stringWriter);
             tidy.setErrout(errorWriter);

             // Extract the document using JTidy and stream it.
             org.w3c.dom.Document doc = tidy.parseDOM(new 
BufferedInputStream(this.inputSource.getInputStream()), null);


If you know how to make JTidy output as you need, tell us and we will 
path the HTMLGenerator.

> If the first is true ("HTMLGenerator can't handle the bhtml-snippet no matter what") I really need to investigate another solution, such as:
> 
>>Look here, maybe it's the right time to ditch tidy entirely
>>
>>http://www.apache.org/~andyc/neko/doc/html/index.html
> 
> 
> ...sounds promising. I'll try to download and investigate. Hopefully I can provide a CleaningHtmlGenerator soon, if it is needed.

Cool :-)

>>>BTW: the example I provided is actually cleaner than much of the code I need Cocoon to deal with.
>>
>>:-O
> 
> 
> I could provide a list of testsnippets that the tidying thing should handle, fx:
> ---
> <h1>Hello <p>How do you do 
> <table border="2 >thing1<td>thing2</table>
> Wondering<p>foo <b>bar <i>baz</b> garply</i>"
> --- should become something like ---
> <html>
> <head>
> </head>
> <body>
> <h1>Hello</h1>
> <p>How do you do
> </p>
> <table border="2">
> <tr><td>thing</td><td>thing2</td></tr>
> </table>
> <p>Wondering
> </p>
> <p>foo <b>bar <i>baz</i></b> <i>garply</i>
> </p>
> </body>
> </html>
> ---


I tried it in the C version og Tidy, this is what I got:

<h1>Hello
<p>How do you do
<table border="2 &gt;thing1&lt;td&gt;thing2&lt;/table&gt; 
Wondering&lt;p&gt;foo &lt;b&gt;bar &lt;i&gt;baz&lt;/b&gt; garply&lt;/i&gt;">
</table>
</p>
</h1>

Maybe changing the rules..

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by Ola Berg <ol...@ports.se>.
From: "Nicola Ken Barozzi" <ni...@apache.org>
> HTMLGenerator uses JTidy directly, without making assumptions itself.
> If you can use JTidy to work for you, it should work - or can be easily 
> made to work - with HTMLGenerator too.

What do you mean? I can use JTidy on my system, whether Cocoon utilizes or not was my question to you, dear community ;-)

Therefore I provided both the sitemap snippet as well as the test bhtml-document.

I use the binary distribution of Cocoon 2.0.2 (where documentation says that this feature is enabled by default). And if it is not enabled by default, I haven't been able to find out how to enable it. 

Question restated: given my configuration and the bhtml document that fails, is it safe to believe that HTMLGenerator utilizes JTidy and that JTidy fails, or is it safe to believe that HTMLGenerator fails because it fails to utilize JTidy? And if the latter is true, how could I tweak it so that JTidy will be utilized by HTMLGenerator? If the first is true ("HTMLGenerator can't handle the bhtml-snippet no matter what") I really need to investigate another solution, such as:

> Look here, maybe it's the right time to ditch tidy entirely
> 
> http://www.apache.org/~andyc/neko/doc/html/index.html

...sounds promising. I'll try to download and investigate. Hopefully I can provide a CleaningHtmlGenerator soon, if it is needed.

> > BTW: the example I provided is actually cleaner than much of the code I need Cocoon to deal with.
> :-O

I could provide a list of testsnippets that the tidying thing should handle, fx:
---
<h1>Hello <p>How do you do 
<table border="2 >thing1<td>thing2</table>
Wondering<p>foo <b>bar <i>baz</b> garply</i>"
--- should become something like ---
<html>
<head>
</head>
<body>
<h1>Hello</h1>
<p>How do you do
</p>
<table border="2">
<tr><td>thing</td><td>thing2</td></tr>
</table>
<p>Wondering
</p>
<p>foo <b>bar <i>baz</i></b> <i>garply</i>
</p>
</body>
</html>
---


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Ola Berg wrote:
> From: "John Moylan" <jo...@rte.ie>
> 
>>You probably need to preprocess your HTML with tidy before you introduce 
>>it to Cocoon.
> 
> 
> Well, according to the sitemap in the cocoon dist (2.0.2), jtidy is involved in the HTML generator.

Yes, correct.

> Yes, preprocessing is a necessity. But I need it preprocessed live and direct in the pipeline by Cocoon, as the bad HTML is generated by legacy scripts that no one dares to touch, just wrap using Cocoon.
> 
> Either way: a HeavyDutyMrProperHtmlGenerator that fixes this using some heavy tidy-stuff should be useful. I understand if the normal HTMLGenerator don't want to waste cycles on handling "HTML" that never should have been written anyway, but if you _know_ you have to deal with pages generated by FrontPage0.6 or perl scripts done by interns in the summer of '96, I think the option should be available. 
> 
> Does such a beast exist somewhere?

HTMLGenerator uses JTidy directly, without making assumptions itself.
If you can use JTidy to work for you, it should work - or can be easily 
made to work - with HTMLGenerator too.

> If not, I intend to write one, as the problem at our company needs to be solved about this yesterday :-)

Look here, maybe it's the right time to ditch tidy entirely

http://www.apache.org/~andyc/neko/doc/html/index.html

> BTW: the example I provided is actually cleaner than much of the code I need Cocoon to deal with.

:-O

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by Ola Berg <ol...@ports.se>.
From: "John Moylan" <jo...@rte.ie>

> You probably need to preprocess your HTML with tidy before you introduce 
> it to Cocoon.

Well, according to the sitemap in the cocoon dist (2.0.2), jtidy is involved in the HTML generator.

Yes, preprocessing is a necessity. But I need it preprocessed live and direct in the pipeline by Cocoon, as the bad HTML is generated by legacy scripts that no one dares to touch, just wrap using Cocoon.

Either way: a HeavyDutyMrProperHtmlGenerator that fixes this using some heavy tidy-stuff should be useful. I understand if the normal HTMLGenerator don't want to waste cycles on handling "HTML" that never should have been written anyway, but if you _know_ you have to deal with pages generated by FrontPage0.6 or perl scripts done by interns in the summer of '96, I think the option should be available. 

Does such a beast exist somewhere?

If not, I intend to write one, as the problem at our company needs to be solved about this yesterday :-)

BTW: the example I provided is actually cleaner than much of the code I need Cocoon to deal with.

> ><html>
> ><body>
> ><h1>Hello, world!</H1>
> >
> >Hi there.
> ><p>
> >This is plain wrong.
> ><p>
> >But it works in certain browsers
> ></body>
> ></html>




---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: Handling lousy HTML

Posted by John Moylan <jo...@rte.ie>.
You probably need to preprocess your HTML with tidy before you introduce 
it to Cocoon.

John

Ola Berg wrote:

>At work I have to handle really bad written HTML (they used some really bad HTML generator):
>
><html>
><body>
><h1>Hello, world!</H1>
>
>Hi there.
><p>
>This is plain wrong.
><p>
>But it works in certain browsers
></body>
></html>
>
>I thought by using the HTMLGenerator, the Tidy-thing should take care of this. In my site map I have
>
><map:generate src=\"hello.html\" type=\"html\"/>
><map:serialize type=\"xhtml\"/>
>
>But the server complains about the source \"hello.html\" being lousy html (containing unbalanced tags). 
>
>1) Shouldn\'t tidy handle this? 
>
>2)Or isn\'t tidy involved when I declare my sitemap as above?
>
>If 1) is no, I plan to hack the functionality (going to the dev list first).
>
>If 2) is no, I\'d like to know how to configure it to handle.
>
>I use cocoon-2.0.2-bin.
>
>TIA
>
>/O
>
>--------------------
>ola.berg@arkitema.se
>0733 - 99 99 17
>
>---------------------------------------------------------------------
>Please check that your question  has not already been answered in the
>FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
>
>To unsubscribe, e-mail:     <co...@xml.apache.org>
>For additional commands, e-mail:   <co...@xml.apache.org>
>
>  
>




******************************************************************************
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee.  Access to this e-mail by anyone else
is unauthorised.  If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT� may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
******************************************************************************

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>