You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by "Sean M. Burke" <sb...@spinn.net> on 2000/07/28 22:08:39 UTC

HTML::Tree and HTML::TreeBuilder + HTML::Parser

I'm jumping into the middle of a discussion of HTML-Tree (which I just
now learned of the existence of) compared to HTML::Parser and
HTML::TreeBuilder.  My apologies if I reveal that I've missed too much
of the discussion to date.

On http://www.best.com/~pjl/software/html_tree/comparison.html,
Paul J. Lucas states:

>HTML Tree is very similar to the HTML::Parser and HTML::TreeBuilder
>Perl modules by Gisle Aas and Michael Chase,

Are you using one of the old old old versions of TreeBuilder that
doesn't have my name on it?  If so, FIE UPON THEE!

Also, be aware that "HTML-Tree" is the name of the CPAN dist that
contains HTML::TreeBuilder, HTML::Element, and two other modules.
Your coming out with another module suite called, homophonically,
"HTML Tree" is just asking for confusion all around.


> [...] except that it: [...]
[I'm taking your points out of order]
> 2. Isn't a strict DTD (Document Type Definition) parser. The goal is
>    to parse HTML files fast, not check for validity. (You should
>    check the validity of your HTML files with other tools before you
>    put them on your web site anyway.) HTML Tree couldn't care less
>    what attributes a given HTML element has just so long as the
>    syntax is correct. This is actually similar to browsers in that
>    both are very permissive in what they accept.

This clearly implicates that TreeBuilder /is/ a strict DTD parser, and
that it checks for validity.  It isn't really, and doesn't really, and
I really wish you would clarify this, lest anyone consider it an
categorical misrepresentation of how TreeBuilder works.

TreeBuilder "knows" a few things about HTML, but only the bare minimum
required to be able to produce correct parse trees.  (For example, if
it sees "<p>foo<p>bar" (which is perfectly /valid/ code), it uses
the fact that a p element can't be a child of another p element, and
so closes the first p before opening the second.  How you can do this
any other way is beyond me.)


Now, if you /can/ say that the HTML you have coming in is going to be
valid AND has close-tags everywhere, then you don't need to know
anything about HTML -- and, in fact, TreeBuilder has a mode,
$tree->implicit_tags(1), that you can switch it into that bypasses
(nearly?) all that slow and clunky context checking.  (I've basically
never used that mode, since it produces incorrect parses for anything
put perfect code -- which is pretty hard to come by.)

But (while I'm off an this tangent) the REAL way to do this would be
to use XHTML.  If you have control over the quality HTML coming it,
then you can just demand that it be XHTML (just run it thru Raggett's
Tidy first), and use an XML parser.  Fast fast fast, because none of
that context checking that's necessary for all non-XML-like SGML
dialects (like HTML).

> 3. Offers simple conditional and looping mechanisms assisting in the
>    generation of dynamic content.

I consider that well outside of the scope of what HTML::TreeBuilder to
be for, so I think it's fine if you do that and I don't.


If I did want that kind of thing, I'd do it on top of XHTML (with the
conditional and loopy things in PIs); or I could do the same in HTML,
and have a 'preprocess' method in TreeBuilder that would traverse the
tree and run whatever the PIs say to run.

In fact, people can even now do this for themselves by just
subclassing TreeBuilder and overriding whatever method it is that
catches PIs, and having loop construct fun in there.  And the new
TreeBuilder will provide a better mechanism for that sort of thing,
which will require no subclassing.

(For various reasons, I'm thinking of deprecating use of TreeBuilder
as a base class, and then explicitly making it unsubclassable; I
REALLY don't like breaking things for other people -- but it's my
experience that very few people have actually been subclassing it.)


> 1. Is several times faster. HTML Tree owes its speed to two things:
>    using mmap(2) to read the HTML file bypassing conventional I/O and
>    buffering, and being written entirely in C++ as opposed to Perl

I haven't benchmarked anything here, but I'll note that my first
priority in my rewrite of HTML::TreeBuilder was that it actually
produce correct parse trees; compatibility with pre-XS versions of
HTML::Parser has also been a priority; and speed, admittedly, is a
distant third.

(Somewhere in there is the new ability to have XML::Twig-like
callbacks, useful in processing large HTML documents; I'm adding that
feature now; mercifully, it doesn't really interfere with the other
priorities of correctness, compatibility, and speed!  I owe a million
thanks to Michel Rodriguez (XML::Twig author) for the idea.  Like all
great ideas, it's obvious -- in retrospect!) 


But speaking of speed:

I'm about to put out a new version of TreeBuilder and Element that
should exhibit improved speed (besides some fancy new features).

I am also considering some speed tweaks that should make TreeBuilder
even faster for people using the XS versions of HTML::Parser -- by
telling the Parser object to not use the derived-class interface, but
by specifying the callbacks that do about the same thing.

(I think I could even optimize things a bit more my scrapping the
whole existing interface and having the specified callbacks be
closures where variables for the top of the tree and a few other temp
variables are under closure -- that'd avoid having to constantly
allocate $self and stuff.  But this would work only under XS versions
of Parser.  I may well just make a version of TreeBuilder that
compiles itself one way for XS versions of Parser, and another way for
pre-XS versions -- as the differences would be minor and systematic.
I'm also considering commenting out all the print "..." if $Debug > 2
statements that are practically every other instruction in
TreeBuilder.  Those are indispensible for any kind of development or
debugging, but they do exact a minor performance hit for users.)

I'm currently banging out my next TPJ article (greatly involving
TreeBuilder, by the way), and once that's done I'll try to release the
new HTML-Tree (TreeBuilder, Element, et al) containing many of the
features discussed above.  Look for it in CPAN hopefully in the next
two weeks.

-- 
Sean M. Burke    sburke@cpan.org    http://www.spinn.net/~sburke/