You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@shindig.apache.org by Louis Ryan <lr...@google.com> on 2008/10/18 01:23:13 UTC

HTML parsing performance and API

Hi,

I've been working recently on the HTML parsing and rewriting features in
Shindig. On aspect of this work was to investigate the performance of the
Caja DOM parser and evaluate others. I evaluated the Neko HTML parser (
http://nekohtml.sourceforge.net/) which is used in many common OS tools and
seemed to have decent performance (
http://www.portletbridge.org/saxbenchmark/results.html). It generally gives
significantly better performance than the Caja DOM parser for equivalent
content and seems to do a good job of maintaining doucment structure and
parsing oddly-formed HTML.

I expanded on johnh's earlier benchmarks results to get comparison times
between Caja and Neko, results below are from parsing an Amazon.com home
page of ~22k. Test accounts for the usual JIT warmup and compilation phase.

Caja Parse------------------------
Parsing [749 ms total: 24.966666666666665ms/run]

Neko Parse------------------------
Parsing [275 ms total: 9.166666666666666ms/run]


The Neko parser actually generates an org.w3c.dom.Document which I need to
wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added
support to GadgetHtmlParser to also produce a Document object. Here are the
benchmark results for that parse including implemeneting a converted from
Caja DOM to w3c DOM.

Caja Parse------------------------
Parsing W3C DOM [292 ms total: 9.733333333333333ms/run]

Neko Parse------------------------
Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run]

Some things worth noting. Converting Caja DOM to w3c DOM is low overhead but
the other way around is not (though this may just be poor coding on my
part).

There is really no functional advantage to having the ParsedHtmlNode
abstraction over DOM if we can use w3c DOM more cheaply or with minimal
overhead in the case of Caja so I propose eliminating these interfaces from
the implementation and altering the rewriter pipeline to consume w3c DOM.

Overall I think the performance of the Neko parser speaks for itself and I
believe its the one we should be using in Shindig by default.

-Louis

Re: HTML parsing performance and API

Posted by John Hjelmstad <fa...@google.com>.

On Fri, Oct 17, 2008 at 4:23 PM, Louis Ryan <lr...@google.com> wrote:

> Hi,
>
> I've been working recently on the HTML parsing and rewriting features in
> Shindig. On aspect of this work was to investigate the performance of the
> Caja DOM parser and evaluate others. I evaluated the Neko HTML parser (
> http://nekohtml.sourceforge.net/) which is used in many common OS tools
> and
> seemed to have decent performance (
> http://www.portletbridge.org/saxbenchmark/results.html). It generally
> gives
> significantly better performance than the Caja DOM parser for equivalent
> content and seems to do a good job of maintaining doucment structure and
> parsing oddly-formed HTML.
>
> I expanded on johnh's earlier benchmarks results to get comparison times
> between Caja and Neko, results below are from parsing an Amazon.com home
> page of ~22k. Test accounts for the usual JIT warmup and compilation phase.
>
> Caja Parse------------------------
> Parsing [749 ms total: 24.966666666666665ms/run]
>
> Neko Parse------------------------
> Parsing [275 ms total: 9.166666666666666ms/run]
>
>
> The Neko parser actually generates an org.w3c.dom.Document which I need to
> wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added
> support to GadgetHtmlParser to also produce a Document object. Here are the
> benchmark results for that parse including implemeneting a converted from
> Caja DOM to w3c DOM.
>
> Caja Parse------------------------
> Parsing W3C DOM [292 ms total: 9.733333333333333ms/run]
>
> Neko Parse------------------------
> Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run]
>
> Some things worth noting. Converting Caja DOM to w3c DOM is low overhead
> but
> the other way around is not (though this may just be poor coding on my
> part).
>
> There is really no functional advantage to having the ParsedHtmlNode
> abstraction over DOM if we can use w3c DOM more cheaply or with minimal
> overhead in the case of Caja so I propose eliminating these interfaces from
> the implementation and altering the rewriter pipeline to consume w3c DOM.
>
> Overall I think the performance of the Neko parser speaks for itself and I
> believe its the one we should be using in Shindig by default.


+1 to both of your conclusions. For my part, I'd be happy to see w3c DOM
replace Gadget/ParsedHtmlNode, and Neko replace CajaHtmlParser.

--John


>
>
> -Louis
>

Re: HTML parsing performance and API

Posted by Ben Laurie <be...@google.com>.

[+google-caja-discuss]

On Sat, Oct 18, 2008 at 12:23 AM, Louis Ryan <lr...@google.com> wrote:
> Hi,
>
> I've been working recently on the HTML parsing and rewriting features in
> Shindig. On aspect of this work was to investigate the performance of the
> Caja DOM parser and evaluate others. I evaluated the Neko HTML parser (
> http://nekohtml.sourceforge.net/) which is used in many common OS tools and
> seemed to have decent performance (
> http://www.portletbridge.org/saxbenchmark/results.html). It generally gives
> significantly better performance than the Caja DOM parser for equivalent
> content and seems to do a good job of maintaining doucment structure and
> parsing oddly-formed HTML.
>
> I expanded on johnh's earlier benchmarks results to get comparison times
> between Caja and Neko, results below are from parsing an Amazon.com home
> page of ~22k. Test accounts for the usual JIT warmup and compilation phase.
>
> Caja Parse------------------------
> Parsing [749 ms total: 24.966666666666665ms/run]
>
> Neko Parse------------------------
> Parsing [275 ms total: 9.166666666666666ms/run]
>
>
> The Neko parser actually generates an org.w3c.dom.Document which I need to
> wrap to to map to the org.apache.shindig.parse.ParsedHtmlNode. So I added
> support to GadgetHtmlParser to also produce a Document object. Here are the
> benchmark results for that parse including implemeneting a converted from
> Caja DOM to w3c DOM.
>
> Caja Parse------------------------
> Parsing W3C DOM [292 ms total: 9.733333333333333ms/run]
>
> Neko Parse------------------------
> Parsing W3C DOM [82 ms total: 2.7333333333333334ms/run]
>
> Some things worth noting. Converting Caja DOM to w3c DOM is low overhead but
> the other way around is not (though this may just be poor coding on my
> part).
>
> There is really no functional advantage to having the ParsedHtmlNode
> abstraction over DOM if we can use w3c DOM more cheaply or with minimal
> overhead in the case of Caja so I propose eliminating these interfaces from
> the implementation and altering the rewriter pipeline to consume w3c DOM.
>
> Overall I think the performance of the Neko parser speaks for itself and I
> believe its the one we should be using in Shindig by default.
>
> -Louis
>