You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Tobia Conforto <to...@linux.it> on 2007/08/31 15:24:58 UTC

Parsing HTML entities

Hello

I have a data source from which I get SAX text nodes into my pipeline
that contain escaped HTML entities and <br> tags.  In Java syntax:

"Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer"

or, in XML syntax:

Lorem ipsum &amp;mdash; dolor sit amet. &lt;br&gt; Consectetuer

As you can see, the entities and <br> tags are escaped and part of the
text node.

I cannot change this data source component, therefore I need a
transformer to examine every text node in the stream, split it at the
fake "<br>" tags, substitute them with <xhtml:br/> elements, and
replace every escaped entity with the relevant Unicode character.

I tried doing it with the Parser transformer, but it's too slow.

I tried using the HTML transformer, but I couldn't get it to work.


My question is: what do you suggest I use on the Java side?

Is there anything like PHP's html_entity_decode() available somewhere
in a library that Cocoon is already using, that can parse and convert
HTML 4.0 entities with a single pass on the string?


Tobia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Parsing HTML entities

Posted by Tobia Conforto <to...@linux.it>.
Never mind, I solved it "by hand"

I wrote a Python script that takes a list of HTML entities and generates
a huge tree of switch() { case: switch () { case: switch () { case: ...

The generated Java code goes through a char[] in a single pass and when
it recognizes an entity it pushes the associated Unicode char into the
SAX stream, instead of the chars composing the entity.

It's pretty brutal, it produces a 36k class file, but it's the fastest
thing that could possibly solve the job, short of writing a C extension!
The pattern transformer took 800ms on some data, where mine takes 2ms!

If anybody is interested, I can post or email the code.


Joerg Heinicke wrote:
> That's one of the rare cases where I consider
> <xsl:text disable-output-escaping="yes"> a valid approach

Yes, that was the first thing I tried, but I discarded it as it was
causing more problems than it solved.


Tobia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Parsing HTML entities

Posted by Tobia Conforto <to...@linux.it>.
Andrew Stevens wrote:
> Tobia Conforto writes:
> > I cannot change this data source component, therefore I need a
> > transformer to examine every text node in the stream, split it at the
> > fake "<br>" tags, substitute them with <xhtml:br/> elements, and
> > replace every escaped HTML entity with the relevant Unicode character.
>
> We have something similar in our application; I arrange the early part
> of the pipeline so that the escaped HTML appears within a unique
> element e.g.
>
>   <some_escaped_html>Lorem ipsum &lt;br&gt; dolor</some_escaped_html>
>
> pass it through the html transformer
>
>   <map:transform type="html">
>     <map:parameter name="tags" value="some_escaped_html"/>
>   </map:transform>
>
> and follow that by a small xsl transformation to strip out the
> some_escaped_html elements and the html & body elements that JTidy
> inserts.
>
> Net result, the same SAX stream but with the HTML unescaped and
> cleaned up so it's well-formed again.

Thank you.
After extensive testing, turns out this is the best method.

It works for any kind of malformed HTML and is efficient enough,
provided I put <some_escaped_html> tags only where they are needed.


Tobia

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RE: Parsing HTML entities

Posted by Andrew Stevens <at...@hotmail.com>.
Oh, for crying out loud.  Even after switching to plain text Hotmail still 
strips out my included XML :-(
Let's try again - replace the square brackets below with the appropriate 
less-than and greater-than symbols.

> From: joerg.heinicke@gmx.de
> Date: Fri, 31 Aug 2007 14:06:59 +0000
>
> Tobia Conforto < tobia.conforto < at> linux.it> writes:
>
>> I have a data source from which I get SAX text nodes into my pipeline
>> that contain escaped HTML entities and 
 tags. In Java syntax:
>>
>> "Lorem ipsum — dolor sit amet. < br> Consectetuer"
>>
>> or, in XML syntax:
>>
>> Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer
>>
>> As you can see, the entities and < br> tags are escaped and part of the
>> text node.
>>
>> I cannot change this data source component, therefore I need a
>> transformer to examine every text node in the stream, split it at the
>> fake "< br>" tags, substitute them with < xhtml:br/> elements, and
>> replace every escaped entity with the relevant Unicode character.
>
> That's one of the rare cases where I consider < xsl:text
> disable-output-escaping="yes"> a valid approach [1]. I don't know if there is
> something comparable directly on the Java side.

Unless I'm mistaken, doing that on his example would result in an invalid
document as there's no matching [/br] element...?  It would be okay if it
can be guaranteed that the included text is nice well-formed XHTML, but if
it's plain old HTML then it sounds to me more like a job for the jtidy or
neko-based HTML transformers.

We have something similar in our application; I arrange the early part of the 
pipeline so that the escaped HTML appears within a unique element e.g.

[some_escaped_html]Lorem ipsum & lt;br& gt; dolor[/some_escaped_html]

, pass it through the html transformer

[map:transform type="html"]
[map:parameter name="tags" value="some_escaped_html"/]
[/map:transform]

and follow that by a small xsl transformation to strip out the some_escaped_html
elements (and the html & body elements that JTidy inserts)

[xsl:template match="vf_escaped_html"]
[xsl:apply-templates select="html/body/*"/]
[/xsl:template]
+ the usual "passthrough" templates for all other nodes.

Net result, the same SAX stream but with the HTML unescaped and cleaned
up so it's well-formed again.


Andrew.

_________________________________________________________________
Get free emoticon packs and customisation from Windows Live. 
http://www.pimpmylive.co.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RE: Parsing HTML entities

Posted by Andrew Stevens <at...@hotmail.com>.
> From: joerg.heinicke@gmx.de
> Date: Fri, 31 Aug 2007 14:06:59 +0000
>
> Tobia Conforto  linux.it> writes:
>
>> I have a data source from which I get SAX text nodes into my pipeline
>> that contain escaped HTML entities and 
 tags. In Java syntax:
>>
>> "Lorem ipsum — dolor sit amet. 
 Consectetuer"
>>
>> or, in XML syntax:
>>
>> Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer
>>
>> As you can see, the entities and 
 tags are escaped and part of the
>> text node.
>>
>> I cannot change this data source component, therefore I need a
>> transformer to examine every text node in the stream, split it at the
>> fake "
" tags, substitute them with  elements, and
>> replace every escaped entity with the relevant Unicode character.
>
> That's one of the rare cases where I consider  disable-output-escaping="yes"> a valid approach [1]. I don't know if there is
> something comparable directly on the Java side.

Unless I'm mistaken, doing that on his example would result in an invalid
document as there's no matching  element...?  It would be okay if it
can be guaranteed that the included text is nice well-formed XHTML, but if
it's plain old HTML then it sounds to me more like a job for the jtidy or
neko-based HTML transformers.

We have something similar in our application; I arrange the early part of the 
pipeline so that the escaped HTML appears within a unique element e.g.

Lorem ipsum <br&ht; dolor

, pass it through the html transformer





and follow that by a small xsl transformation to strip out the some_escaped_html
elements (and the html & body elements that JTidy inserts)




+ the usual "passthrough" templates for all other nodes.

Net result, the same SAX stream but with the HTML unescaped and cleaned
up so it's well-formed again.


Andrew.

_________________________________________________________________
Get free emoticon packs and customisation from Windows Live. 
http://www.pimpmylive.co.uk
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Parsing HTML entities

Posted by Joerg Heinicke <jo...@gmx.de>.
Tobia Conforto <tobia.conforto <at> linux.it> writes:

> I have a data source from which I get SAX text nodes into my pipeline
> that contain escaped HTML entities and <br> tags.  In Java syntax:
> 
> "Lorem ipsum &mdash; dolor sit amet. <br> Consectetuer"
> 
> or, in XML syntax:
> 
> Lorem ipsum &amp;mdash; dolor sit amet. &lt;br&gt; Consectetuer
> 
> As you can see, the entities and <br> tags are escaped and part of the
> text node.
> 
> I cannot change this data source component, therefore I need a
> transformer to examine every text node in the stream, split it at the
> fake "<br>" tags, substitute them with <xhtml:br/> elements, and
> replace every escaped entity with the relevant Unicode character.

That's one of the rare cases where I consider <xsl:text
disable-output-escaping="yes"> a valid approach [1]. I don't know if there is
something comparable directly on the Java side.

Joerg

[1] http://www.w3.org/TR/xslt#disable-output-escaping


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org