You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Tobia Conforto <to...@linux.it> on 2007/09/17 13:33:52 UTC
Re: Parsing HTML entities
Andrew Stevens wrote:
> Tobia Conforto writes:
> > I cannot change this data source component, therefore I need a
> > transformer to examine every text node in the stream, split it at the
> > fake "<br>" tags, substitute them with <xhtml:br/> elements, and
> > replace every escaped HTML entity with the relevant Unicode character.
>
> We have something similar in our application; I arrange the early part
> of the pipeline so that the escaped HTML appears within a unique
> element e.g.
>
> <some_escaped_html>Lorem ipsum <br> dolor</some_escaped_html>
>
> pass it through the html transformer
>
> <map:transform type="html">
> <map:parameter name="tags" value="some_escaped_html"/>
> </map:transform>
>
> and follow that by a small xsl transformation to strip out the
> some_escaped_html elements and the html & body elements that JTidy
> inserts.
>
> Net result, the same SAX stream but with the HTML unescaped and
> cleaned up so it's well-formed again.
Thank you.
After extensive testing, turns out this is the best method.
It works for any kind of malformed HTML and is efficient enough,
provided I put <some_escaped_html> tags only where they are needed.
Tobia
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org