You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Tobia Conforto <to...@linux.it> on 2007/09/17 13:33:52 UTC

Re: Parsing HTML entities

Andrew Stevens wrote:
> Tobia Conforto writes:
> > I cannot change this data source component, therefore I need a
> > transformer to examine every text node in the stream, split it at the
> > fake "<br>" tags, substitute them with <xhtml:br/> elements, and
> > replace every escaped HTML entity with the relevant Unicode character.
>
> We have something similar in our application; I arrange the early part
> of the pipeline so that the escaped HTML appears within a unique
> element e.g.
>
>   <some_escaped_html>Lorem ipsum &lt;br&gt; dolor</some_escaped_html>
>
> pass it through the html transformer
>
>   <map:transform type="html">
>     <map:parameter name="tags" value="some_escaped_html"/>
>   </map:transform>
>
> and follow that by a small xsl transformation to strip out the
> some_escaped_html elements and the html & body elements that JTidy
> inserts.
>
> Net result, the same SAX stream but with the HTML unescaped and
> cleaned up so it's well-formed again.

Thank you.
After extensive testing, turns out this is the best method.

It works for any kind of malformed HTML and is efficient enough,
provided I put <some_escaped_html> tags only where they are needed.


Tobia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org