You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Samar Lotia <sl...@siebel.com> on 2002/04/30 05:46:52 UTC

RE: Call for Vote: which one to be the Xerces-C++ public supporte d d W3C DOM interface

I don't know enough about the implementations of DOMString or the IDOM tree
model, so I will phrase my comments partly as questions.
 
Is the internal implementation also going to use this lightweight string
class to maintain it's strings? If not, how would we guarantee that nobody
is going to change the memory that this DOMString is pointing to? If we
cannot guarantee this, then the semantics of this 'DOMString' may be
confusing to users.
 
Also, what about 'read-only' thread safety. I know this is not necessarily a
design goal here, but it sure would be nice to have. Has any thought been
given to this? Or is the current thinking that each thread must be looking
at it's own copy of the DOM.
 
Again, from what I have understood of Lenny's handle/body implementation we
do not have 'read-only' thread safety. Consider the case where two threads
are pointing to the same element from a given document tree. As both of them
destruct their handle object, both handles will attempt to decrement the
reference count and we could end up with a race condition here. This MAY
result in unpredictable behavior, unless we implement the reference counting
in a thread safe manner. This is especially true on SMP machines. I may have
this all wrong and this may be thread-safe, in which case I apologize.
 
Simply by implementing thread safe increment/decrement, we can guarantee
that as long as NO CHANGES are made to the document itself, multiple threads
can be reading various parts of it. This is because on each object there
will be at least one handle holding a reference count (i.e. the main
document itself), hence no objects will need to be added to the allocator's
free list. Note that if we end up deleting something, and this has to be
added to the document's free list then the allocator needs to be
thread-safe. Note also that making the per document allocator thread safe
may not be too bad as there will rarely be contention for this allocator. We
can use a non-yielding spin latch (STLport does one) which would mean very
little overhead for having a truly thread-safe (read-only) DOM model. Note
that in many cases the spin latch is highly optimized by relying on OS
specific interfaces for atomic operations (Win32 InterlockedXXXX functions),
or in some cases hand optimizing assembly to implement atomic operations
(STLport does this on Solaris).
 
This is one thing that the existing IDOM has going for it, i.e. it is
'read-only' thread safe.
 
More of my two bits...
 
Samar Lotia

-----Original Message-----
From: Lenny Hoffman [mailto:lennyhoffman@earthlink.net]
Sent: Monday, April 29, 2002 21:47
To: xerces-c-dev@xml.apache.org
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supported
d W3C DOM interface


I just had a new thought; if having a DOMString class is desired, for
functionality and/or DOM compliance, then the smart pointer approach can
still be used by updating the IDOM classes to return DOMString instances
instead of XMLCh*.  With using smart pointers we would still only have one
set of interfaces to maintain, and performance would be negligibly affected
as I pointed out earlier that I modified DOMString to simply wrap an alias
to the node owned XMLCh* data, and only makes a copy if modified.
 
Lenny

-----Original Message-----
From: Lenny Hoffman [mailto:lennyhoffman@earthlink.net]
Sent: Monday, April 29, 2002 9:37 PM
To: xerces-c-dev@xml.apache.org
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supporte d
W3C DOM interface


Hi Samar,
 
You make good points.  
 
I would agree that it is reasonable to nix the DOMString, but does anyone
object to that given that DOMString is explicitly specified in the W3C DOM
specification?  Judging so far from the early responders to the vote, no, as
folks voting for the IDOM interface are also voting to nix the DOMString
class. 
 
(Tinny), do you anticipate the W3C to complain if the C++ binding does not
have a DOMString?  In other words, will we be able to call ourselves DOMx
compliant without it?
 
One more consequence of using the smart pointer approach is that backwards
compatibility with the original DOM interfaces is sacrificed for backwards
compatibility with the IDOM interfaces.  I thought that with the original
DOM interfaces being officially supported and around longer that backwards
compatibility to it would be more important, but so far I no one using the
original DOM interface has spoken up.  For my use cases it simply doesn't
matter, what matters most to me is functional behavior and ease of use.
 
Just to make it easier to review, here is the earlier example following your
suggestion to avoid using an int operator on node for null comparison:
 

if (!pm_Element.isNull()) 
    pm_Element->getAttribute(...);
 
Lenny
 

-----Original Message-----
From: Samar Lotia [mailto:slotia@siebel.com]
Sent: Monday, April 29, 2002 7:59 PM
To: 'xerces-c-dev@xml.apache.org'
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supporte d
W3C DOM interface


If the desire is to maintain only one interface, then I would be of the
opinion that we should nix the DOMString class and use a 'smart pointer'
class to wrapper the internal interfaces. In many cases, people will likely
have their own preferred string class which they use and will immediately
convert the value extracted from the DOM before passing into any other layer
of their code.
 
If we keep DOMString around, I would recommend against having a (const XMLCh
*) operator as this can result in some incredibly hard to track errors. Most
C++ style guides recommend against implicit conversion operators. Note the
lack of such an operator in the C++ standard library string, i.e.
std::basic_string<T>. Having something like rawBuffer, or XMLCh() would be
clearer and lets one control lifetimes in a much clearer way (IMHO).
 
Also, I would recommend against adding an int operator on the smart pointer
class. It is not that much work to call isNull on the object, and is much
clearer from a readability perspective as well as helps catch silly errors
at compile time. If we must have such an operator then it may be better to
add a bool operator instead of int, as this will likely reduce the number of
places where the implicit conversion operator will be called.
 
My two bits...
 
Samar Lotia

-----Original Message-----
From: Lenny Hoffman [mailto:lennyhoffman@earthlink.net]
Sent: Monday, April 29, 2002 19:38
To: xerces-c-dev@xml.apache.org
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface


Hi Markus,
 
Thank you very much for the insight.
 
Note that simply accessing the IDOM implementation via handles does not
affect its thread safety-ness, thus your application is safe.
 

if (pm_Element) 
    pm_Element->getAttribute(...);
 
How can I do this with references? 
 
You do it with the current handles like this:
 
if (!pm_Element.isNull()) 
    pm_Element.getAttribute(...);
 
Adding an int operator to DOM_Node would allow even more friendly syntax;
e.g.
 

if (pm_Element) 
    pm_Element.getAttribute(...);
 
This could be easily added.
 
In fact, an -> operators could be added to the DOM_Node classes and get
this:
 
if (pm_Element) 
    pm_Element->getAttribute(...);
 
This is now exactly what you started out with, thus is completely backward
compatible with your current use of the IDOM.
 
 

XMLCh* are easier to handle as DOMString-Objects in ATL :  CComBSTR cBstr =
pm_Element->getAttribute(...);
 
Good point, the current DOMString class does not have an XMLCh* operator,
which if it did would solve your problem.  I pretty much gutted the original
DOMString class to make it a simple wrapper around an XMLCh* returned from
IDOM implementations, in lieu of suffering the costs of a the cross document
string management of the original DOM.  As far as I can tell the only reason
the original DOMString did not have an XMLCh* operator was because there was
no guarantee that its internal XMLCh* was null terminated; well, that
guarantee does now exist and the operator can be added -- I will do that.
So your example remains:
 

CComBSTR cBstr = pm_Element->getAttribute(...);
 
Note that string classes are convenient way to perform various operations on
a string without using the static (read functional) methods provided by
XMLString.  I even implemented COW (copy on write) behavior in the new
DOMString class, so that you can feel free to modify a string returned from
a node without having to manually make a copy.
 
If folks don't find the DOMString wrapper to be that important, that frees
me up to simplify the handle classes and address one of Tinny's concerns.
Tinny pointed out that while the new design hides dual interfaces (DOM and
IDOM) from users, it does not hide them from DOM developers;  as DOM 3
support is added, each interface change would have to be made to both DOM
and IDOM classes.  The only reason I went with complete interface
replication instead of simple smart pointers for the handle classes was to
be able to translate XMLCh pointers returned from IDOM nodes into
DOMStrings.  If I am allowed to get rid of DOMString altogether I can make
the handle classes simple smart pointers that do not replicate IDOM
interfaces, and thus the duplication of effort is eliminated.  
 
Lenny
 
 -----Original Message-----
From: Markus Fellner [mailto:fellner@gimbio.de]
Sent: Monday, April 29, 2002 6:17 PM
To: xerces-c-dev@xml.apache.org; lenny.hoffman@objectivity.com
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface



O.k the main reaseon for my IDOM flirtation is...
I've chosen IDOM cause of its thread-safeness. And now I have several
thousands lines of code using IDOM interface. 
 
Some other reasons are...
I have many IDOM_Element*  members (pm_Elem) in my classes. After parsing
they will be assigned one time and than many times checked if they are
really assigned and used for reading and writing attributes.
 
if (pm_Element) 
    pm_Element->getAttribute(...);
 
How can I do this with references? 
 
XMLCh* are easier to handle as DOMString-Objects in ATL :  CComBSTR cBstr =
pm_Element->getAttribute(...);
...
 
Sorry for my short answer. I go on holiday tomorrow  and i have to pack up! 
 
I'm back in 2 weeks and looking forward to see the results of this voting.
It's a pitty to go during a hot discussion on this list.
 
Markus

-----Ursprüngliche Nachricht-----
Von: Lenny Hoffman [mailto:lennyhoffman@earthlink.net]
Gesendet: Montag, 29. April 2002 23:54
An: mf@gimbio.de; xerces-c-dev@xml.apache.org; lenny.hoffman@objectivity.com
Betreff: RE: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface


Hi Markus,
 
To be clear, the fix I created for the IDOM was to recycle memory once a
node or string is no longer needed.   To know when a node is no longer
needed I used the original DOM interface, but have them wrapping up the IDOM
as the implementation.  IDOM performance is maintained, but ease of use is
greatly improved.  Without using the DOM handles to know when an IDOM node
is in use or not, application code will be drawn into explicitly stating
when a node is no longer needed and can be recycled, which is yet another
thing to be documented and to for application developers to get wrong and
suffer consequences for.
 
If you love and use the IDOM for its performance, you want the memory
problem fixed so that it is really fixed, not a workaround that only works
if your application does everything right, then you will love what I have
done with combining DOM classes as handles, and IDOM classes as bodies.
 
If what you love is working with pointers instead of with objects, please
let me know why.  
 
One thing I have found harder with objects vs.. pointers is down casting
from node to derived objects like element.  The syntax is a bit cleaner with
pointers; e.g.:
 
    DOM_Node node = ...
    DOM_Element elem =  (const DOM_Element&)node;
 
vs:
 
    IDOM_Node* node = ..
    IDOM_Element* elem = (IDOM_Element*)node; 
 
It is easy to forget to add the const in the first case, and is somewhat
non-intuitive because slicing can happen, though it is not problem in this
case.
 
To solve this problem I have thought of adding overloaded constructors and
assignment operators that take a DOM_Node to DOM_Node derived classes like
DOM_Element.  Thus the first example becomes:
 

    DOM_Node node = ...
    DOM_Element elem =  node;
 
Not only is this code more succinct, but it is safer, as the overloaded
constructor and assignment operator can check for node compatibility via the
getNodeType call.
 
Again, please let me know what other aspects of points make things easier
for you.
 
> Hope your fix has no effects on thread-safe-ness!
 
No affect whatsoever.
 
Lenny

-----Original Message-----
From: Markus Fellner [mailto:fellner@gimbio.de]
Sent: Monday, April 29, 2002 4:15 PM
To: xerces-c-dev@xml.apache.org; lenny.hoffman@objectivity.com
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface


Hi Lenny,
 
I hope your fix of the IDOM memory problem goes into the next official
release. But I use and love the IDOM interface.
It's really easier for an old C++ programmer like me! And I use IDOM cause
of its threadsafe properties. Hope your fix has no effects on
thread-safe-ness!
 
Markus
 

-----Ursprüngliche Nachricht-----
Von: Lenny Hoffman [mailto:lennyhoffman@earthlink.net]
Gesendet: Montag, 29. April 2002 17:57
An: xerces-c-dev@xml.apache.org; mf@gimbio.de
Betreff: RE: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface


Hi Markus,
 
The memory management problem solved by recycling no longer used nodes and
strings.  The only clean way I know to know when nodes and strings are being
used is to use the handle/body pattern, which is what is used by the
original DOM.  What I have done is use the original DOM handles and the IDOM
implementation, but fixed the IDOM memory problem.
 
Lenny

-----Original Message-----
From: Markus Fellner [mailto:fellner@gimbio.de]
Sent: Monday, April 29, 2002 10:54 AM
To: xerces-c-dev@xml.apache.org
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported
W3C DOM interface


If the memory management problem is solved, I prefer IDOM!!!

-----Ursprüngliche Nachricht-----
Von: Tinny Ng [mailto:tng-xml@ca.ibm.com]
Gesendet: Montag, 29. April 2002 17:08
An: xerces-c-dev@xml.apache.org
Betreff: Call for Vote: which one to be the Xerces-C++ public supported W3C
DOM interface



Hi everyone,
 
I've reviewed Andy's design objective of IDOM, Lenny's view of old DOM and
his proposal of redesign, and some users feedback.   Here is a "quick"
summary and I would like to call for a VOTE about the fate of these two
interfaces.
 
1.0 Objective
==========
1.  Define the strategy of Xerces-C++ public DOM interface.  Decide which
one to keep, old DOM interface or new IDOM interface
 
 
2.0 Motivation
===========
1. As a long term strategy, Xerces-C++ shouldn't define two W3C DOM
interfaces which simply confuses users.   
    => We've already got many users' questions about what the difference,
which one to use ... etc.
2. With limited resource, we should focus our development on ONE stream, no
more duplicate effort
    => New DOM Level 3 development should be done on one interface, not
both.
    => No more dual maintenance: two set of samples (e.g. DOMPrint vs
IDOMPrint), two parsers (DOMParser vs IDOMParser)
3. To better place Apache Xerces-C++ in the market, we should have our
Apache Recommended DOM C++ Binding in http://www.w3.org/DOM/Bindings
<http://www.w3.org/DOM/Bindings> 
    => To encourage more users to develop DOM application AND implementation
based on this binding.
    => Such binding should just define a set of abstract base classes
(similar to JAVA interface) where no implementation model is assumed
 
 
3.0 History
=========
'DOM' was the initial "W3C DOM interface" developed by Xerces-C++.  However
the performance of its implementation is not quite satisfactory.

Last year, Andy Heninger came up with a new design with faster performance,
and such implementation came with a new set of interface => 'IDOM'.
 
Currently both 'DOM' and 'IDOM' are shipped with Xerces-C++.  'IDOM' is
claimed as experimental (like a prototype) and is subject to change.

More information can be found in :
http://xml.apache.org/xerces-c/program.html
<http://xml.apache.org/xerces-c/program.html> 
 <http://www.apache.org/~andyh/> http://www.apache.org/~andyh/
 <http://marc.theaimsgroup.com/?t=101650188300002&r=1&w=2>
http://marc.theaimsgroup.com/?t=101650188300002&r=1&w=2
http://marc.theaimsgroup.com/?w=2
<http://marc.theaimsgroup.com/?w=2&r=1&s=Proposal%3A+C%2B%2B+Language+Bindin
g+for+DOM+L&q=t> &r=1&s=Proposal%3A+C%2B%2B+Language+Binding+for+DOM+L&q=t
 <http://www.apache.org/~andyh/>  
 
 
4.0 IDOM
=========
4.1 Interface
==========
 
4.1.1 Features of IDOM Interface
--------------------------------------------------
e.g. virtual IDOM_Element* IDOM_Document::createElement(const XMLCh*
tagName) = 0;
 


1. Define as abstract base classes 
2. Use normal C++ pointers.
    => So that abstract base class is possible.
    => Make it more C++ like. Less Java like.
 
 
4.1.2 Pros and Cons of IDOM Interface
----------------------------------------------------------
Pros:
1. Abstract base classes that correspond to the W3C DOM interfaces
    => Can be recommended as Apache DOM C++ Binding
    => More standard like, no implementation assumed as they are just
abstract interfaces using pure virtual functions
2. (Depends on users' preference)
    - someone prefers C++ like style
 
Cons:
1. IDOM_XXX - weird prefix 'I'
    Solution:
        - Proposed to rename to DOMXXXX which also matches the DOM Level 3
naming convention
2. (Depends on users' preference)
    - someone does not like pointers, and wants Java-like interface for ease
to use, ease to learn and ease to port (from Java).
3. As the old DOM interface has been around for a long time, majority of
current Xerces-C++ still uses the old DOM interface, significant migration
impact
    Solution:
        - Announce the deprecation of old DOM interface for a couple of
releases before removal
    

4.2 Implementation
===============

4.2.1 Features of IDOM Implementation
-----------------------------------------------------------
1. Use an independent storage allocator per document. The advantage here is
that allocation would require no synchronization 
    => Fast, good scalability, reduced memory footprint
2. Use plain, null-terminated (XMLCh *) utf-16 strings. 
    => No DOMString class overhead which is another performance contributor
that makes IDOM faster
 
 
4.2.2 Downside of IDOM Implementation
-------------------------------------------------------------
1. Manual memory management 
    - If document comes from parser, then parser owns the document.  If
document comes from DOMImplementation, then users are responsible to delete
it.
    Solution:
        - Provide a means of disassociating a document from the parser
        - Add a function "Node::release()", similar to the idea of
"Range::detach", which allows users to indicate the release of the Node.  
            - From C++ Binding abstract interface perspective, it's up to
implementation how to handle this "release()" function.
            - With Xerces-C++ IDOM implementation, the release() function
will delete the 'this' pointer if it is a document, else no-op.
2. Memory retained until the document is deleted.
    - If you change the value of an attribute or call removeNode many times,
the memory of the old value is not deallocated for reuse and the document
grows and grows
    Solution:
        - This in fact is a tradeoff for the fast performance offered by
independent storage allocator.  
        - There is no immediate good solution in place
 
 


5.0 old DOM
==========
5.1 Interface
========== 
 
5.1.1 Features of old DOM Interface
-----------------------------------------------------
e.g. DOM_Element DOM_Document::createElement(const DOMString tagName);
 
1. Use smart pointers - Java-like
 
 
5.1.2 Pros and Cons of old DOM Interface
--------------------------------------------------------------
Pros:

1. DOM_XXX - reasonable name
2. (Depends on users' preference)
    - someone wants Java-like interface for ease to use, ease to learn and
ease to port (from Java).
3. Not that many users have migrated to IDOM yet, so migration impact is
minimal.
 
Cons:
1. Not abstract base class
    - Cannot be recommended as Apache DOM C++ Binding
    - Implementation (smart pointer indirection) is assumed
    Solution:
        - This in fact is a tradeoff for the ease of use of smart pointer
design
        - No solution.

2. (Depends on users' preference)
    - someone wants C++-like as this is C++ interface
 
    

5.2 Implementation
===============
5.2.1 Features of old DOM Implementation
----------------------------------------------------------------
1. Automatic memory management
    - Memory is released when there is no more handles pointing to it
    - Use reference count to keep track of handles
2. Use thread-safe DOMString class
 
 
5.2.2 Downside of old DOM Implementation
--------------------------------------------------------------------
1. Performance is slow
    - Memory management is the biggest time consumer, and a lot of memory
footprint.
    - There are a whole lot of blocks allocated when creating a document and
then freed when finished with it. Each and every node requires at least one
and sometimes several separately allocated blocks. DOMString take three. It
adds up.
    Solution:
        - Lenny suggests to use IDOM interface internally in DOM
implementation, patch in Bugzilla 5967
        - Then the performance benefits of IDOM is gained but the memory
retained problem in IDOM implementation still remains to address.   
        - And internally, we will have dual interface maintenance model as
IDOM interface is then used by DOM internally.
 
 

Vote Question:
============
I would like to call for a vote:
 
    ==>  Which INTERFACE should be the Xerces-C++ public supported W3C DOM
Interface, DOM or IDOM? <===
 
Note:  
1. The question is asking which "interface" to be officially supported.
Once the choice of interface is chosen, we can discuss how to solve the
downside of implementation as the next topic.
2. The one being voted will become the ONLY Xerces-C++ supported public W3C
DOM Interface, and is where the DOM Level 3 being implemented.
3. The API of the other interface will be deprecated.  And its samples, and
associated Parser will eventually be removed from the distribution