You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@xalan.apache.org by Cory Isaacson <ci...@compuflex.com> on 2002/10/01 03:13:19 UTC

Using DTMDocumentImpl

We have a situation where we need a very memory efficient, lightweight DOM
capability. I'm curious to know if the Xalan DTM would work for our needs.
The only things we do are create elements with namespaces, modify the value
nodes on occasion, and specify some attributes. The rest of our application
is handled through Xalan XSLT conversions, which are very efficient, but
because we are building the DOM in memory (using Xerces) its taking a lot
more memory than we would like (and much more than the actual data sizes we
are working with).

I know that the DTM is not a publicly supported object, but from perusing
the JavaDoc it looks like it does most if not all of what we need.

Any suggestions would be very much appreciated.

Thanks,

Cory

Cory Isaacson
President & CTO
Compuflex International
(818) 884-1168
cisaacson@compuflex.com
www.compuflex.com

RE: Using DTMDocumentImpl

Posted by Joseph Kesselman <ke...@us.ibm.com>.

CachedXPathAPI, as the name says, caches DTMs for the documents you've run 
XPaths against. It _shouldn't_ be allocating a new DOM2DTM for each 
DOMSource... but that would be worth checking.

Of course DOM2DTM isn't exactly lightweight; it's most of a DTM built 
alongside your DOM. We're hoping the experimental DOM2DTM2 code might 
provide a lighter-weight alternative.

(Note that there's a problem if you're changing your DOM between XPath 
queries; DOM2DTM assumes the source DOM is not being altered, and 
violating that expectation can break Xalan. Changing the content of text 
nodes is _probably_ safe, but most other changes wouldn't be, and would 
probably require flushing the DTM from the cache and rebuilding it from 
scratch... and I think our only flush mechanism today is to switch to a 
new instance of CachedXPathAPI. Another of the reasons we're experimenting 
with the DOM2DTM2 approach is that it should be more tolerant of 
alterations in the DOM between XPath scans.)

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Posted by Cory Isaacson <ci...@compuflex.com>.

Joseph,

This is very helpful, and I really appreciate the time you took to answer my
question.

We are using Xerces, which on looking further may not be our biggest
culprit.

We do a lot of XPath queries of our documents, and recently we switched from
the standard Xalan XPathAPI object to creating a persistent instance of
CachedXPathAPI. I have not totally narrowed it down yet, but it appears this
object may be chewing up a lot of memory (we're using up about 500K per call
which contains the creation of a Xerces DOM and the CachedXPathAPI). Do you
think that's possible? If so, I can just go back to the static call to
XPathAPI.

Thanks,

Cory

-----Original Message-----
From: Joseph Kesselman [mailto:keshlam@us.ibm.com]
Sent: Tuesday, October 01, 2002 1:55 PM
To: xalan-dev@xml.apache.org
Subject: RE: Using DTMDocumentImpl


>If you think the DTM would be far more efficient

Depends, of course, on which DOM implementation you're comparing it to.

The "shoehorned" DTMDocumentImpl was intended to pack a node's core data
(not counting strings) into just four integers (plus some amortized
overhead).  That's certainly more compact than most straightforward DOM
implementations, which generally use an object per node -- even a
completely empty Object consumes several times that space, last I checked,
and then you have to add the member fields.

On the other hand, there are definite downsides. Part of that compression
is achieved by de-optimizing certain operations -- there's no link to the
previous sibling; to find it we look at the parent's first-child and then
scan next-siblings until just before our starting node. And the tests and
mask-and-shift operations needed to extract the bitfields from those
integers also consume some cycles. We tried to avoid de-optimizing the
operations most important to XSLT, but others are on their own.

As I say, I have no idea what the current status of DTMDocumentImpl is; I
don't think we've actually tried running it in a Very Long Time. Getting
it running at all may be the first step...


Alternatively, there's the current DTM code -- DTMDefaultBase and the
SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on
the other hand it isn't as slow. Rather than a single table of
four-integer chunks and extracting subfields via shift-and-mask, it uses a
separate table for each "column" of data... and it adds a few columns such
as previous-sibling. DTMDefaultBase also contains a lot of support
specifically for Xalan's needs.  I'd call this "more efficient" rather
than "far more efficient" -- probably a factor of 2 rather than a factor
of 3-4. (Note that this only refers to node size; as mentioned earlier,
strings aren't compacted... but we do try to share single instances when a
string is used repeatedly, and our FastStringBuffer is used to avoid the
overhead of an object per string.)

DTMDefaultBase will probably handle larger documents than DTMDocumentImpl,
if that matters to you... at least, it will do so when teamed up with a
DTMManager which understands the overflow-addressing scheme, such as
DTMManagerDefault.


NOTE: DTM has been biased toward the XPath view of the document rather
than the DOM view. The current DTM in particular tends to elide details
which XPath doesn't care about. If you need something that captures all
the details of a DOM, such as Entity Reference Nodes or the Document Type
tree or exactly how text and <![CDATA[]]> have been mixed within a single
element, DTM as it stands will probably not meet your needs.

Similarly, DTM is really designed to be an immutable model. As noted
above, changing a single string value is probably possible but may have to
account for some interesting interactions. Changing the structure is not
something either DTMDefaultBase or DTMDocumentImpl are currently able to
handle, though there's a minor step in that direction in the RTF pruning
code. DOM2DTM2 hopes to be more flexibile in that regard.... between
stylesheet/XPath passes, not during them.


Experience with DTM has been mixed. On the one hand, it is a more compact
model. On the other hand, you may give up a lot of the power of your
compiler and debugger to help you analyse your application; you can no
longer just expand an object to see what a node contains, and you can't
count on datatypes to help you distinguish between DTM Handles (the
integers the application uses), DTM IDs (the integers DTM uses), and other
integers. In the "do as I say, not as I did" department, I would strongly
recommend you adopt a naming convention to help keep those value types
from getting tangled.


I know, that's a lot of "it depends" -- but that's the best answer I can
give you; the choice of data structure really does depend on what your
needs are. Hope it helps, anyway. Good luck...

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Posted by Cory Isaacson <ci...@compuflex.com>.

Joseph,

If you do think the XPathAPI is not efficient, we could also just use DOM
methods to find elements, such as getElementById, etc. There aren't too many
places we do this, so I'm open to either way that you recommend.

Thanks,

Cory

-----Original Message-----
From: Joseph Kesselman [mailto:keshlam@us.ibm.com]
Sent: Tuesday, October 01, 2002 1:55 PM
To: xalan-dev@xml.apache.org
Subject: RE: Using DTMDocumentImpl


>If you think the DTM would be far more efficient

Depends, of course, on which DOM implementation you're comparing it to.

The "shoehorned" DTMDocumentImpl was intended to pack a node's core data
(not counting strings) into just four integers (plus some amortized
overhead).  That's certainly more compact than most straightforward DOM
implementations, which generally use an object per node -- even a
completely empty Object consumes several times that space, last I checked,
and then you have to add the member fields.

On the other hand, there are definite downsides. Part of that compression
is achieved by de-optimizing certain operations -- there's no link to the
previous sibling; to find it we look at the parent's first-child and then
scan next-siblings until just before our starting node. And the tests and
mask-and-shift operations needed to extract the bitfields from those
integers also consume some cycles. We tried to avoid de-optimizing the
operations most important to XSLT, but others are on their own.

As I say, I have no idea what the current status of DTMDocumentImpl is; I
don't think we've actually tried running it in a Very Long Time. Getting
it running at all may be the first step...


Alternatively, there's the current DTM code -- DTMDefaultBase and the
SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on
the other hand it isn't as slow. Rather than a single table of
four-integer chunks and extracting subfields via shift-and-mask, it uses a
separate table for each "column" of data... and it adds a few columns such
as previous-sibling. DTMDefaultBase also contains a lot of support
specifically for Xalan's needs.  I'd call this "more efficient" rather
than "far more efficient" -- probably a factor of 2 rather than a factor
of 3-4. (Note that this only refers to node size; as mentioned earlier,
strings aren't compacted... but we do try to share single instances when a
string is used repeatedly, and our FastStringBuffer is used to avoid the
overhead of an object per string.)

DTMDefaultBase will probably handle larger documents than DTMDocumentImpl,
if that matters to you... at least, it will do so when teamed up with a
DTMManager which understands the overflow-addressing scheme, such as
DTMManagerDefault.


NOTE: DTM has been biased toward the XPath view of the document rather
than the DOM view. The current DTM in particular tends to elide details
which XPath doesn't care about. If you need something that captures all
the details of a DOM, such as Entity Reference Nodes or the Document Type
tree or exactly how text and <![CDATA[]]> have been mixed within a single
element, DTM as it stands will probably not meet your needs.

Similarly, DTM is really designed to be an immutable model. As noted
above, changing a single string value is probably possible but may have to
account for some interesting interactions. Changing the structure is not
something either DTMDefaultBase or DTMDocumentImpl are currently able to
handle, though there's a minor step in that direction in the RTF pruning
code. DOM2DTM2 hopes to be more flexibile in that regard.... between
stylesheet/XPath passes, not during them.


Experience with DTM has been mixed. On the one hand, it is a more compact
model. On the other hand, you may give up a lot of the power of your
compiler and debugger to help you analyse your application; you can no
longer just expand an object to see what a node contains, and you can't
count on datatypes to help you distinguish between DTM Handles (the
integers the application uses), DTM IDs (the integers DTM uses), and other
integers. In the "do as I say, not as I did" department, I would strongly
recommend you adopt a naming convention to help keep those value types
from getting tangled.


I know, that's a lot of "it depends" -- but that's the best answer I can
give you; the choice of data structure really does depend on what your
needs are. Hope it helps, anyway. Good luck...

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Posted by Joseph Kesselman <ke...@us.ibm.com>.

>If you think the DTM would be far more efficient

Depends, of course, on which DOM implementation you're comparing it to. 

The "shoehorned" DTMDocumentImpl was intended to pack a node's core data 
(not counting strings) into just four integers (plus some amortized 
overhead).  That's certainly more compact than most straightforward DOM 
implementations, which generally use an object per node -- even a 
completely empty Object consumes several times that space, last I checked, 
and then you have to add the member fields. 

On the other hand, there are definite downsides. Part of that compression 
is achieved by de-optimizing certain operations -- there's no link to the 
previous sibling; to find it we look at the parent's first-child and then 
scan next-siblings until just before our starting node. And the tests and 
mask-and-shift operations needed to extract the bitfields from those 
integers also consume some cycles. We tried to avoid de-optimizing the 
operations most important to XSLT, but others are on their own.

As I say, I have no idea what the current status of DTMDocumentImpl is; I 
don't think we've actually tried running it in a Very Long Time. Getting 
it running at all may be the first step...


Alternatively, there's the current DTM code -- DTMDefaultBase and the 
SAX2DTM and DOM2DTM classes derived from it. This isn't as compact, but on 
the other hand it isn't as slow. Rather than a single table of 
four-integer chunks and extracting subfields via shift-and-mask, it uses a 
separate table for each "column" of data... and it adds a few columns such 
as previous-sibling. DTMDefaultBase also contains a lot of support 
specifically for Xalan's needs.  I'd call this "more efficient" rather 
than "far more efficient" -- probably a factor of 2 rather than a factor 
of 3-4. (Note that this only refers to node size; as mentioned earlier, 
strings aren't compacted... but we do try to share single instances when a 
string is used repeatedly, and our FastStringBuffer is used to avoid the 
overhead of an object per string.)

DTMDefaultBase will probably handle larger documents than DTMDocumentImpl, 
if that matters to you... at least, it will do so when teamed up with a 
DTMManager which understands the overflow-addressing scheme, such as 
DTMManagerDefault.


NOTE: DTM has been biased toward the XPath view of the document rather 
than the DOM view. The current DTM in particular tends to elide details 
which XPath doesn't care about. If you need something that captures all 
the details of a DOM, such as Entity Reference Nodes or the Document Type 
tree or exactly how text and <![CDATA[]]> have been mixed within a single 
element, DTM as it stands will probably not meet your needs. 

Similarly, DTM is really designed to be an immutable model. As noted 
above, changing a single string value is probably possible but may have to 
account for some interesting interactions. Changing the structure is not 
something either DTMDefaultBase or DTMDocumentImpl are currently able to 
handle, though there's a minor step in that direction in the RTF pruning 
code. DOM2DTM2 hopes to be more flexibile in that regard.... between 
stylesheet/XPath passes, not during them.


Experience with DTM has been mixed. On the one hand, it is a more compact 
model. On the other hand, you may give up a lot of the power of your 
compiler and debugger to help you analyse your application; you can no 
longer just expand an object to see what a node contains, and you can't 
count on datatypes to help you distinguish between DTM Handles (the 
integers the application uses), DTM IDs (the integers DTM uses), and other 
integers. In the "do as I say, not as I did" department, I would strongly 
recommend you adopt a naming convention to help keep those value types 
from getting tangled. 


I know, that's a lot of "it depends" -- but that's the best answer I can 
give you; the choice of data structure really does depend on what your 
needs are. Hope it helps, anyway. Good luck...

______________________________________
Joe Kesselman  / IBM Research

RE: Using DTMDocumentImpl

Posted by Cory Isaacson <ci...@compuflex.com>.

Joseph,

The DOM2DTM2 may help us in some situations, but our main problem now is the
amount of memory the DOM takes, and I was hoping to find something that
takes less resources.

If you think the DTM would be far more efficient, I'd be willing to have one
of our developers look at adding the code to modify the content of a value
node or attribute.

Let me know what you think, and based on your input we'll be considering our
options.

Thanks,

Cory

-----Original Message-----
From: Joseph Kesselman [mailto:keshlam@us.ibm.com]
Sent: Tuesday, October 01, 2002 7:22 AM
To: xalan-dev@xml.apache.org
Subject: Re: Using DTMDocumentImpl


DTMDocumentImpl was intended to be a port of the "ultra-compressed"
first-draft version of DTM to the Xalan 2.0 environment. As far as I know,
we never finished porting it; the code is in unstable state. If someone
can invest the cycles to try to bring it up to full operation, it'd be an
interesting space/performance comparison point.

Note that DTM has no ability to "modify the value" of a node. It's
strictly a write-once-read-many API. Theoretically, changing the content
of an existing attribute or text node would not be hard to add... but
practically, that has ugly interactions with issues like namespace lookup
and ID nodes and such. I'd hesitate to go that route.


Alternatively: If your concern is the overhead of DTM on top of the DOM
(the DOM2DTM layer) rather than the size of your source DOM per se, I just
checked in a VERY early draft of a "thinner" adapter (DOM2DTM2) over on
the XSLT20 branch. It's specifically intended to avoid replicating so much
of the DOM structure, and to better tolerate repeatedly running
stylesheets over the same source DOM. In its current form it is rather
slow, but I hope to improve that.

______________________________________
Joe Kesselman  / IBM Research

Re: Using DTMDocumentImpl

Posted by Joseph Kesselman <ke...@us.ibm.com>.

DTMDocumentImpl was intended to be a port of the "ultra-compressed" 
first-draft version of DTM to the Xalan 2.0 environment. As far as I know, 
we never finished porting it; the code is in unstable state. If someone 
can invest the cycles to try to bring it up to full operation, it'd be an 
interesting space/performance comparison point.

Note that DTM has no ability to "modify the value" of a node. It's 
strictly a write-once-read-many API. Theoretically, changing the content 
of an existing attribute or text node would not be hard to add... but 
practically, that has ugly interactions with issues like namespace lookup 
and ID nodes and such. I'd hesitate to go that route.


Alternatively: If your concern is the overhead of DTM on top of the DOM 
(the DOM2DTM layer) rather than the size of your source DOM per se, I just 
checked in a VERY early draft of a "thinner" adapter (DOM2DTM2) over on 
the XSLT20 branch. It's specifically intended to avoid replicating so much 
of the DOM structure, and to better tolerate repeatedly running 
stylesheets over the same source DOM. In its current form it is rather 
slow, but I hope to improve that.

______________________________________
Joe Kesselman  / IBM Research