You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by Balazs Somogyi <ba...@FATHOMTECHNOLOGY.com> on 2003/02/26 15:34:36 UTC

digester + DOM

Hi,
 
Is it possible to feed digester with an already parsed XML (actually
XHTML).
I'm using JTidy to parse HTML and would like to extract some of its
elements but don't want to traverse manually the tree.
 
Thanks in advance for your help,
Balazs

Re: [OT] SAX & DOM was: digester + DOM

Posted by robert burrell donkin <ro...@blueyonder.co.uk>.

On Wednesday, February 26, 2003, at 04:08 PM, Erik Price wrote:
> James Strachan wrote:
>> Rather than using JTidy to parse HTML (which makes a DOM) you could use
>> NekoHTML which is-a SAX parser that can handle HTML. Then you don't need 
>> to
>> use a DOM.
>
> Sorry to hijack a thread like this, but I was curious -- if you're 
> building an in-memory representation of an XML document, is there still a 
> compelling reason to use a SAX parser?  Or should you just use DOM in 
> that case.

james can probably give you a pretty definitive answer to this question 
but here's my two penneth.

i think that the answer about this depends on what in-memory 
representation you want. DOM is a generic representation. different kinds 
of xml (eg having different schemas) are represented using the same 
objects. this may be good or bad depending on the circumstances. if you're 
interested in general xml then a general representation is best. but there'
s more than DOM out there. there are several general representations (eg. 
dom4j) which offer more java-friendly APIs.

even when you're dealing with general representations, SAX (and therefore 
digester) can have advantages over DOM. with SAX it is easy to filter so 
that only the part of the object model you're interested in is created. 
digester has a rule that creates partial DOM object models which can be 
used in this way.

on the other hand, a very common use case is having a particular object 
model in mind which is represented by strongly typed java beans. in this 
case, though the mapping is to an in-memory object model, there is a 
considerable performance benefit (both speed and memory) in using SAX 
rather than DOM. there are a number of technologies (eg. castor, JAXB, 
betwixt) which do this - and digester is also commonly used for this 
purpose.

- robert

[OT] SAX & DOM was: digester + DOM

Posted by Erik Price <ep...@ptc.com>.

James Strachan wrote:
> Rather than using JTidy to parse HTML (which makes a DOM) you could use
> NekoHTML which is-a SAX parser that can handle HTML. Then you don't need to
> use a DOM.

Sorry to hijack a thread like this, but I was curious -- if you're 
building an in-memory representation of an XML document, is there still 
a compelling reason to use a SAX parser?  Or should you just use DOM in 
that case.

I haven't really done much with XML parsing and was wondering about this.

Erik

Re: digester + DOM

Posted by James Strachan <ja...@yahoo.co.uk>.

Rather than using JTidy to parse HTML (which makes a DOM) you could use
NekoHTML which is-a SAX parser that can handle HTML. Then you don't need to
use a DOM.

NekoHTML plugs right into Digester allowing you to fire Digester rules
straight from the SAX events coming out of the HTML

http://www.apache.org/~andyc/neko/doc/html/

James
-------
http://radio.weblogs.com/0112098/
----- Original Message -----
From: "Balazs Somogyi" <ba...@FATHOMTECHNOLOGY.com>
To: <co...@jakarta.apache.org>
Sent: Wednesday, February 26, 2003 2:34 PM
Subject: digester + DOM


Hi,

Is it possible to feed digester with an already parsed XML (actually
XHTML).
I'm using JTidy to parse HTML and would like to extract some of its
elements but don't want to traverse manually the tree.

Thanks in advance for your help,
Balazs

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com

Re: digester + DOM

Posted by Janek Bogucki <ya...@studylink.com>.

Hi Balazs,

> From: "Balazs Somogyi" <ba...@FATHOMTECHNOLOGY.com>
> Reply-To: "Jakarta Commons Users List" <co...@jakarta.apache.org>
> Date: Wed, 26 Feb 2003 15:34:36 +0100
> To: <co...@jakarta.apache.org>
> Subject: digester + DOM
> 
> Hi,
> 
> Is it possible to feed digester with an already parsed XML (actually
> XHTML).
> I'm using JTidy to parse HTML and would like to extract some of its
> elements but don't want to traverse manually the tree.
> 
> Thanks in advance for your help,
> Balazs
> 

You could address the elements you want with XPath. This is likely to be a
better approach than serializing the XHTML object tree and having Digester
act on that.

Jakarta has an XPath implementation

    http://jakarta.apache.org/commons/jxpath/index.html

There is also Jaxen (http://jaxen.sourceforge.net/) with can be used to
address W3C DOM, dom4j, JDOM and XOM object trees.

-Janek