You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xerces.apache.org by Dor-Shifer Amit <Am...@comverse.com> on 2006/03/26 10:36:24 UTC

re-use of DOMDocument causing memory bloat?

Hello all,
 
In an application using xerces-c, I'm periodically printing an xml file,
containing key=val data. val reflects app. data, and is constantly
changing (statistic data).

Frequently, val is set to '0'. In those cases, I consider it irrelevant
for printing, so I wish to remove it from the final xml file. 

The first solution for this was simply to recreate the DOM tree whenever
time comes for printing. This impsed a heavy load (in terms of
performance), so an alternative was thought-of, to re-use as much of the
tree as possible. Each time we want to print it, we remove the redundant
'0' data nodes, re-add the prev. removed nodes, and print. This
implementation yields a gradually-increasing memory consumption. I
traced it mainly to:

1. DOMElementImpl::getElementsByTagName (used to locate the correct
place to insert a node) 
2. DOMElementImpl::release (used when the node is removed, because the
val it contains ='0')

So my questions are:
a. Is this concept of partial re-use of an XML tree wrong? If it's ok,
then I'd expect the API not to hog memory.
b. Am I using the correct API for the task? If so, is there a way to
impose a clean-up of memory?
c. If the concept of re-use here is indeed wrong, what would you advise
as a solution for this issue, given that performance is important?

10x,
Amit

Re: re-use of DOMDocument causing memory bloat?

Posted by Axel Weiß <aw...@informatik.hu-berlin.de>.

Dor-Shifer Amit wrote:
> 1. DOMElementImpl::getElementsByTagName (used to locate the correct
> place to insert a node)
> 2. DOMElementImpl::release (used when the node is removed, because the
> val it contains ='0')

Hi Dor-Shifer,
two questions you should check first:

1. Do you delete the DOMNodeList you get as a result from 
DOMElementImpl::getElementsByTagName, after use?

2. Do you explicitly release the attributes of the removed nodes, before 
you release the nodes? You might be interested in reading e.g.  
http://blog.parthenoncomputing.com/xerces/archives/2005/05/memory_manageme.html

> So my questions are:
> a. Is this concept of partial re-use of an XML tree wrong? If it's ok,
> then I'd expect the API not to hog memory.

I'd say it's ok. But you have to take special care when you release nodes 
that have attributes.

> b. Am I using the correct API for the task? If so, is there a way to
> impose a clean-up of memory?

I'd not prefer DOMElement::getElementsByTagName, rather I'd traverse the 
document to keep track of it's structure. (This statement means nothing 
but to say that I'm not very familiar with the side effects, that node 
removal might have, while traversing the DOMNodeList). It is generally 
non-trivial to manipulate lists while traversing them.

> c. If the concept of re-use here is indeed wrong, what would you
> advise as a solution for this issue, given that performance is
> important?

I've just picked from my sources an example of how I'm used to manipulate 
DOM trees. It is part of a 'pretty printer' that works in two steps, 
first removes all whitespace nodes, second inserts pretty printing 
whitespace nodes. The whitespace removal step is an example of how I use 
to manipulate a DOM tree while traversing it.

void remove_ws_nodes(DOMNode *node){
	if (node->getNodeType() == DOMNode::ELEMENT_NODE){
		// remove all leading ws-only nodes:
		for (DOMNode *child=node->getFirstChild(); child; 
child=node->getFirstChild()){
			if (!is_text_node_whitespace_only(child)) break;
			node->removeChild(child)->release();
		}
		for (DOMNode *child=node->getFirstChild(); child; 
child=child->getNextSibling()){
			if (is_text_node_whitespace_only(child)){
				// since we have no leading ws-only nodes (just removed them all),
				// predessor of child exists:
				DOMNode *prev = child->getPreviousSibling();
				DOMNode *n = child;
				child = prev;
				node->removeChild(n)->release();
			}
			// recursively remove ws-only nodes of all children:
			else remove_ws_nodes(child);
		}
	}
}

There are two loops. The first loop works only on the beginning of a node 
list (via DOMNode::getFirstChild()), which makes no troubles at all. The 
second loop then iterates on the list (by calling 
DOMNode::getNextSibling()) which potentially leads into problems when 
the current node is removed from the node list. In this case, the 
current node is switched to it's predessor, in order to continue 
traversing correctly after removal. Whitespace-only text nodes have no 
attributes, so there is no danger to leak with releasing them.

I prefer to make these non-trivial matters explicit, and I have never 
observed any memory leaks with this method.

HTH,
			Axel

Re: re-posting: re-use of DOMDocument causing memory bloat?

Posted by Laurent Oget <de...@oget.net>.

>  
>
>>-----Original Message-----
>>From: Dor-Shifer Amit 
>>Sent: Sunday, March 26, 2006 10:36
>>To: 'c-users@xerces.apache.org'
>>Subject: re-use of DOMDocument causing memory bloat?
>>
>>Hello all,
>> 
>>In an application using xerces-c, I'm periodically printing 
>>an xml file, containing key=val data. val reflects app. data, 
>>and is constantly changing (statistic data).
>>
>>Frequently, val is set to '0'. In those cases, I consider it 
>>irrelevant for printing, so I wish to remove it from the 
>>final xml file. 
>>
>>The first solution for this was simply to recreate the DOM 
>>tree whenever time comes for printing. This impsed a heavy 
>>load (in terms of performance), so an alternative was 
>>thought-of, to re-use as much of the tree as possible. Each 
>>time we want to print it, we remove the redundant '0' data 
>>nodes, re-add the prev. removed nodes, and print. This 
>>implementation yields a gradually-increasing memory 
>>consumption. I traced it mainly to:
>>
>>1. DOMElementImpl::getElementsByTagName (used to locate the 
>>correct place to insert a node) 2. DOMElementImpl::release 
>>(used when the node is removed, because the val it contains ='0')
>>
>>So my questions are:
>>a. Is this concept of partial re-use of an XML tree wrong? If 
>>it's ok, then I'd expect the API not to hog memory.
>>b. Am I using the correct API for the task? If so, is there a 
>>way to impose a clean-up of memory?
>>c. If the concept of re-use here is indeed wrong, what would 
>>you advise as a solution for this issue, given that 
>>performance is important?
>>
>>10x,
>>Amit
>>
>>
>>
>>    
>>

It sounds like the DOM Tree somehow keeps traces of all nodes it ever held.
An ugly solution that does not involve digging into the guts of Xerces 
would be
to recreate the tree once for every thousandth print.

good luck,

laurent

re-posting: re-use of DOMDocument causing memory bloat?

Posted by Dor-Shifer Amit <Am...@comverse.com>.

 Re-posting this. Still have no resolution to issues.
10x,
Amit

> -----Original Message-----
> From: Dor-Shifer Amit 
> Sent: Sunday, March 26, 2006 10:36
> To: 'c-users@xerces.apache.org'
> Subject: re-use of DOMDocument causing memory bloat?
> 
> Hello all,
>  
> In an application using xerces-c, I'm periodically printing 
> an xml file, containing key=val data. val reflects app. data, 
> and is constantly changing (statistic data).
> 
> Frequently, val is set to '0'. In those cases, I consider it 
> irrelevant for printing, so I wish to remove it from the 
> final xml file. 
> 
> The first solution for this was simply to recreate the DOM 
> tree whenever time comes for printing. This impsed a heavy 
> load (in terms of performance), so an alternative was 
> thought-of, to re-use as much of the tree as possible. Each 
> time we want to print it, we remove the redundant '0' data 
> nodes, re-add the prev. removed nodes, and print. This 
> implementation yields a gradually-increasing memory 
> consumption. I traced it mainly to:
> 
> 1. DOMElementImpl::getElementsByTagName (used to locate the 
> correct place to insert a node) 2. DOMElementImpl::release 
> (used when the node is removed, because the val it contains ='0')
> 
> So my questions are:
> a. Is this concept of partial re-use of an XML tree wrong? If 
> it's ok, then I'd expect the API not to hog memory.
> b. Am I using the correct API for the task? If so, is there a 
> way to impose a clean-up of memory?
> c. If the concept of re-use here is indeed wrong, what would 
> you advise as a solution for this issue, given that 
> performance is important?
> 
> 10x,
> Amit
> 
> 
> 
>

RE: re-use of DOMDocument causing memory bloat?

Posted by Dor-Shifer Amit <Am...@comverse.com>.

Seems this is a known bug:
http://issues.apache.org/jira/browse/XERCESC-1465
Sorry about the noise.
Amit

> -----Original Message-----
> From: Dor-Shifer Amit 
> Sent: Sunday, March 26, 2006 10:36
> To: 'c-users@xerces.apache.org'
> Subject: re-use of DOMDocument causing memory bloat?
> 
> Hello all,
>  
> In an application using xerces-c, I'm periodically printing 
> an xml file, containing key=val data. val reflects app. data, 
> and is constantly changing (statistic data).
> 
> Frequently, val is set to '0'. In those cases, I consider it 
> irrelevant for printing, so I wish to remove it from the 
> final xml file. 
> 
> The first solution for this was simply to recreate the DOM 
> tree whenever time comes for printing. This impsed a heavy 
> load (in terms of performance), so an alternative was 
> thought-of, to re-use as much of the tree as possible. Each 
> time we want to print it, we remove the redundant '0' data 
> nodes, re-add the prev. removed nodes, and print. This 
> implementation yields a gradually-increasing memory 
> consumption. I traced it mainly to:
> 
> 1. DOMElementImpl::getElementsByTagName (used to locate the 
> correct place to insert a node) 2. DOMElementImpl::release 
> (used when the node is removed, because the val it contains ='0')
> 
> So my questions are:
> a. Is this concept of partial re-use of an XML tree wrong? If 
> it's ok, then I'd expect the API not to hog memory.
> b. Am I using the correct API for the task? If so, is there a 
> way to impose a clean-up of memory?
> c. If the concept of re-use here is indeed wrong, what would 
> you advise as a solution for this issue, given that 
> performance is important?
> 
> 10x,
> Amit
> 
> 
> 
>