You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Jay Cain <ja...@cett.msstate.edu> on 2001/01/19 01:59:24 UTC

[PATCH] Optimized normalization methods

This patch modifies AttributeMap.java, AttrImpl.java, ElementImpl.java, and
ParentNode.java to optimize normalization. This is done by performing a
check when the set of a parent node's children have changed (inserted,
removed). If the change causes the parent to be unnormalized, it is flagged
as so, and its ancestors are inherently flagged as unnormalized.
Effectively, a call to the normalize method normalizes only the nodes that
need it, rather than walking through the entire subtree doing a brute-force
check.

This optimization is most effective on documents that are several levels
deep or for applications that blindly and frequently call normalize (such as
Cocoon). I did some tests using Cocoon and achieved an average performance
increase of 10% with a best-case scenario yielding close to 20%. Additional
memory overhead involves a boolean value added to AttrImpl and ParentNode,
which seems fairly trivial.

- - - - -
Jay Cain
Center for Educational and Training Technology
Mississippi State University

Re: [PATCH] Optimized normalization methods

Posted by Arnaud Le Hors <le...@us.ibm.com>.
I hope I didn't give the wrong impression. I actually think this is an
interesting patch that is definitely worth considering. 10 to 20% of
performance improvement is clearly significant!
I'm planning to apply the patch with the few modifications I mentioned
in my previous message. I'll also make some performance test to see how
it impacts building the DOM.
Stay tuned.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

Re: [PATCH] Optimized normalization methods

Posted by Arnaud Le Hors <le...@us.ibm.com>.
Jay,
I applied your patch and reviewed the code again. I run some tests that
showed a slight regression in performance while building the tree.
However, after some editing of your code, mostly getting rid of some
method calls and casts by accessing the structure directly, I was able
to get back to the same level of performance. So I commited the changes!
Thanks for your contribution.

Hey, one thing though, I found in your patch:

+	void checkNormalizationAfterInsert(Node insertedChild) {
+		if (insertedChild == null)
+			throw new NullPointerException();
+
+		// See if insertion caused this node to be unnormalized.
+		switch (insertedChild.getNodeType()) {

The test for null is useless and can only lead to slower code. Accessing
the variable in the switch condition will throw a NullPointerException
if insertedChild is null anyway...
Hope this helps.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

RE: [PATCH] Optimized normalization methods

Posted by Jay Cain <ja...@cett.msstate.edu>.
Here is an updated patch for optimizing normalization. The unnormalized flag
is now kept like all the other flags. This simplified things a bit and
consolidated some redundant code. Also, the default action for
checkNormalizationAfterInsert (in ParentNode and AttrImpl) is to check the
inserted child's normalized state.

Re: [PATCH] Optimized normalization methods

Posted by Arnaud Le Hors <le...@us.ibm.com>.
Jay Cain wrote:
> 
> > We also need to propagate the flag when attaching an unnormalized
> > attribute to an element.
> 
> The modification to the AttributeNodeMap does this.

Indeed, I missed that.

> I got your second reply while I was writing this one. I'll work on the
> changes and repost the patch, unless you really want to do it. :)

Well, I'm busy with something else right now, so if you do it it will
happen sooner.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

RE: [PATCH] Optimized normalization methods

Posted by Jay Cain <ja...@cett.msstate.edu>.
Arnaud,

Thanks for your reply.

> Instead of adding a new boolean member we can easily add a new bit flag.
> I converted long ago all the booleans into a single short
> (NodeImpl.flags) I use to store as bit flags all the booleans we need.

Whoops! Sorry about that. I'll change it.

> Assuming the parser builds a normalized document the only
> overhead would be the check. No propagation would occur because nothing
> would be unnormalized. That should be fine.

The conditions it checks prevent parsers from flagging it as unnormalized.
The only way it could be unnormalized is by modifying the set of children to
have two adjacent text nodes, so the parser (should) never meet this
condition, thus the flag is never propagated.

Also, when a node is already flagged as unnormalized, the propagation stops.
This prevents unnecessary redundancy.

> In checkNoramlizationAfterInsert on ParentNode, I would rather make the
> code more flexible and extend the case Node.ELEMENT_NODE to the default
> case. I'm worried that there might be other types of nodes that could be
> unnormalized and that we'd miss. EntityReference nodes come to mind.

I'll do that.

> We also need to propagate the flag when attaching an unnormalized
> attribute to an element.

The modification to the AttributeNodeMap does this.

I got your second reply while I was writing this one. I'll work on the
changes and repost the patch, unless you really want to do it. :)

- - - - -
Jay Cain
Center for Educational and Training Technology
Mississippi State University


Re: [PATCH] Optimized normalization methods

Posted by Arnaud Le Hors <le...@us.ibm.com>.
Hi Jay,
I just reviewed your patch and I have a couple of comments.

First, my guts reaction:

> Additional
> memory overhead involves a boolean value added to AttrImpl and ParentNode,
> which seems fairly trivial.

ARGHHHHHHHHHHHH!!!!! You're killing me!!! :-)
Seriously, I've spent many days working at reducing the size of the
nodes in memory, so it's NOT ok to just go ahead and start again on the
road of adding booleans and other things like that without carefully
considering it.

Now, let's move on to a more rationale reaction. ;-)

Instead of adding a new boolean member we can easily add a new bit flag.
I converted long ago all the booleans into a single short
(NodeImpl.flags) I use to store as bit flags all the booleans we need.
While the Java spec doesn't prevent a JVM to do just that internally,
the state of affairs at this point is that JVMs typically use a whole
int per boolean!

Now, looking further into your patch I see there is an impact on
performance while editing the tree. Simply because you have to make some
additional check and you have to propagate the unnormalized flag up the
tree when needed. Note that I'm not saying that's necessarily bad. I'm
just trying to assess the pros and cons. One thing I'm worried about for
instance is whether the initial build of the DOM from the parser will be
slower. Assuming the parser builds a normalized document the only
overhead would be the check. No propagation would occur because nothing
would be unnormalized. That should be fine.

Finally, a couple of details:

In checkNoramlizationAfterInsert on ParentNode, I would rather make the
code more flexible and extend the case Node.ELEMENT_NODE to the default
case. I'm worried that there might be other types of nodes that could be
unnormalized and that we'd miss. EntityReference nodes come to mind.
There aren't supposed to be affected but who knows...

We also need to propagate the flag when attaching an unnormalized
attribute to an element.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group