You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Ryan Schmidt <ry...@cpsc.ucalgary.ca> on 2000/06/15 01:49:31 UTC

[Fix] DOMParser > handling bug

[Note: This fixes the problem in Xerces v1.1.1. I'm assuming the same
 problem exists in the CVS tree, but I'm not sure how to check. It's
 kind of long because I don't really know if the changes I suggest
 are going to affect anything else, so someone who knows more then
 me should give it a thorough check]

This is the fix for the problem where, with the XML string:

	<List><Skills>Language -&gt; Java; Tool -&gt;&quot;Visual
Age&quot;</Skills></List>

a getNodeValue() call on the first Text node inside Skills Element 
would return 'Languange -' instead of the proper string.

The bug only exists in DeferredDocumentImpl. DOMParser recognizes each
of
the escaped characters as an internal entity(?) so they each get their
own
characters() call. If you turn off deferred-expansion, DOMParser will 
concatenate all the substrings together into one Text node. If not, it
adds each piece of the string seperately and lets the expansion take
care of normalizing. (I personally think this is kind of odd, there is
still more then one text node...)

But normalizing doesn't work because getNodeType(int) of
DeferredDocumentImpl
_always_ returns -1. In fact, all the getXXX(int) methods of 
DeferredDocumentImpl return -1. Even worse, they destroy the fNodeXXX
'chunks' that store the expansion information. Why? Well....

All these methods call an overloaded version of themselves that takes a 
boolean that lets these other methods know whether or not to destroy
the 'chunk' information. These booleans are all passed as true
when I'm pretty sure they should be false. Here is one of the methods:


  public short getNodeType(int nodeIndex) {
      return getNodeType(nodeIndex, true);
  }


And here is the overloaded version:


  public short getNodeType(int nodeIndex, boolean free) {
      if (nodeIndex == -1) {
          return -1;
      }

      int chunk = nodeIndex >> CHUNK_SHIFT;
      int index = nodeIndex & CHUNK_MASK;
      if (free) {
          return (short)clearChunkIndex(fNodeType, chunk, index);
      }
      return (short)getChunkIndex(fNodeType, chunk, index);

  } // getNodeType(int):int


So the getXXX(int) methods always destroys the chunk information it
references and returns -1. I'm under the impression that this
is a Bad Thing. If you go through and change all those true's to
false's, 
the escaped strings work. 

This should be done for all these methods:
  getNodeName(int)
  getNodeNameString(int)
  getNodeType(int)
  getNodeURI(int)
  getNodeValue(int)
  getNodeValueString(int)
  getParentNode(int)
  getPrevSibling(int)
  getRealPrevSibling(int)

Note that I have no idea what the implications of these changes are
beyond the
simple tests I did (to make sure all the quoted characters were handled
properly).

Sorry, but I have no idea how to make a Patch, and I'm not using the
most current
CVS version anyway. Hope this helps.

-RMS

Re: [Fix] DOMParser > handling bug

Posted by Ryan Schmidt <ry...@cpsc.ucalgary.ca>.
Ack. I forgot something. This change is also necessary
in DeferredDocumentImpl.getNodeObject(int):

replace the line:
	int type = clearChunkIndex(fNodeType, chunk, index);
with:
	int type = getChunkIndex(fNodeType, chunk, index);

Ok, that's it..

-RMS

Re: [Fix] DOMParser > handling bug

Posted by Ryan Schmidt <ry...@cpsc.ucalgary.ca>.
> The reason why they are there is in order to save memory.
> The chunks do some "reference counting" so that when they are
> accessed, the large chunks of integers can be garbage collected.

Ok. Now this all makes sense. Some of the data in the chunks is used 
in synchronizeChildren and discarded...

> I'm guessing that it's just a
> matter of fixing the DeferredTextImpl's synchronizeData method.

Absolutely. The problem was that the syncronizeData method was
trying to get the chunk information for it's parent and sibling
nodes that was discarded by synchronizeChildren. So that data
was available in the actual nodes instead. Here is a 
solution (maybe):

----DeferredTextImpl.synchronizeData()-----


	/** Synchronizes the underlying data. */
	protected void synchronizeData() {

		// no need for future synchronizations
		syncData(false);

		// get initial text value
		DeferredDocumentImpl ownerDocument =
			(DeferredDocumentImpl) this.ownerDocument();
		data = ownerDocument.getNodeValueString(fNodeIndex);

			
		// append next node's text value to this one
		Node parent = this.getParentNode();
		if (getNodeType() == Node.TEXT_NODE && parent != null &&
				parent.getNodeType() == Node.ELEMENT_NODE) {

			Node nextNode = this.getNextSibling();
			if (nextNode != null && nextNode.getNodeType() == Node.TEXT_NODE) {
				StringBuffer sb = new StringBuffer(data);
				sb.insert(sb.length(), nextNode.getNodeValue());
				data = sb.toString();

				//clean up the tree
				this.nextSibling = null;
				this.previousSibling = null;
			}
		}

		// ignorable whitespace
		ignorableWhitespace(ownerDocument.getLastChild(fNodeIndex) == 1);

	} // synchronizeData()

---end---


This works for my simple tests. The only thing I'm not really sure about
is the two lines that are commented 'clean up the tree'. If you remove
these lines, getNodeValue() still works. But XMLSerializer is broken
(along with, most likely, a _LOT_ of user code), because there is
actually more than one Text node. You get the full text with
getNodeValue()
on the first one, then the last n-1 Texts with getNodeValue() on the 
second, etc. The fix for this is simple, eg here is one for
XMLSerializer:

---XMLSerializer.serializeElement() (near the bottom)---


		while ( child != null ) {
			serializeNode( child );
			if(child.getNodeType() == Node.TEXT_NODE)
				child = null;
			else
				child = child.getNextSibling();
		}
---end---


...but I'd suggest not doing this because of all the user code it will
break. Mine
is broken with the second solution, and I'm probably not the only one.
Enjoy =)

-RMS

Re: [Fix] DOMParser > handling bug

Posted by Andy Clark <an...@apache.org>.
Ryan Schmidt wrote:
> All these methods call an overloaded version of themselves that takes a
> boolean that lets these other methods know whether or not to destroy
> the 'chunk' information. These booleans are all passed as true
> when I'm pretty sure they should be false. Here is one of the methods:

Wow, I'm amazed at how quickly you came to understand the inner
workings of the deferred DOM implementation! It's a strange
and mysterious beast...

Anyway, I don't know if it's a good idea to just change all of the
calls to clear the chunk information so that they don't clear it
anymore. The reason why they are there is in order to save memory.
The chunks do some "reference counting" so that when they are
accessed, the large chunks of integers can be garbage collected.
By changing all of the call sites, then you don't have the ability
to collect those array chunks.

I would prefer to know the exact place where this is occuring and
then change that code to not obliterate the node type information
before it is absolutely necessary. I'm guessing that it's just a
matter of fixing the DeferredTextImpl's synchronizeData method.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org