You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Neil Bacon <ne...@cambia.org> on 2006/08/25 08:55:10 UTC

memory leak with DTD entity references?

Hi,
I've been running out of memory reusing the same XMLReader 
(xercesImpl-2.8.0) to parse many large documents.
The documents reference the same DTD which references many entities.
Profiling (with netbeans-5.0) reveals that the problem is with char[]s 
allocated by:

 org.apache.xerces.util.SymbolTable.$Entry.<init>
   org.apache.xerces.util.SymbolTable.addSymbol()
     org.apache.xerces.impl.XMLEntityScanner.scanName()
       org.apache.xerces.impl.XMLDTDscannerImpl.scanEntityDecl()
...
Maybe its storing the symbol table for the same DTD for each new 
document and never discarding it?
Should it recognize a previously parsed DTD and reuse the existing 
symbol table?

I've worked around it by using a new XMLReader for each document.

Can I get DTDs and entities cached to improve performance?
I'm using org.apache.xerces.util.XMLCatalogResolver.

Cheers,
    Neil.




---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: memory leak with DTD entity references?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Neil Bacon <ne...@cambia.org> wrote on 08/29/2006 02:45:15 AM:

<snip/>

> > From a quick perusal these DTDs (including the external entities they 
> > reference) look very large. It's not just the entity declarations. 
Just 
> > about everything in these DTDs which match the Name production from 
the 
> > XML spec gets added to the SymbolTable. I assume each document you 
parse 
> > only references one of them. Perhaps it's the sum of the unique names 
from 
> > each of the DTDs which leads to your app running out of memory
> Yes they are quite large, however I still think there is a problem 
because:
> 
> 1) even when using "java -Xmx7000M" (thats 7 salesman's gigabytes) it 
> falls over (whereas 300Mb is enough if I use a new parser for each doc);
> 
> 2) profiling shows that symbol table entries exist with a continuously 
> growing number of different garbage collection generations (new entries 
> are continuously being added without the old ones being cleaned up). If 
> the cache was working new entries would not be created once each DTD had 

> been read once.

I still think more unique names than you're counting are being pumped into 
the SymbolTable. The workaround I mentioned in the other thread avoids the 
issue with unbounded symbol table growth. You could even write an 
extension to the SymbolTable which is memory sensitive (i.e. uses 
SoftReferences) and register it in place of the default SymbolTable.

> Is it possible that I'm messing things up by having xercesImpl-2.8.0 in 
> the classpath without pointing to it with -Djava.*endorsed*.*dirs?

If org.apache.xerces.* classes aren't included in the JRE you're using you 
don't need to use java.endorsed.dirs (but you should probably be using it 
for the xml-apis.jar).

> Cheers,
>    Neil.
> *
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: memory leak with DTD entity references?

Posted by Neil Bacon <ne...@cambia.org>.
Michael Glavassevich wrote:
>> Perhaps this behaviour could be affected by my use of 
>> org.apache.xerces.util.XMLCatalogResolver?
>>     
>
> How are you using it?
>   
        XMLReader r = factory.newSAXParser().getXMLReader();
        r.setEntityResolver(entityResolver);

with catalog.xml containing:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <!-- US applications use "us-sequence-listing.dtd"
                grants use "us-sequence-listing-2004-03-09.dtd"
       I've only found the later at the USPTO, so we make the former 
refer to the later.
  -->
  <system systemId="us-sequence-listing.dtd" 
uri="dtd/us-sequence-listing-2004-03-09.dtd"/>

  <!-- works with apache xerces XMLCatalogResolver -->
  <rewriteSystem systemIdStartString="c:\pap\dtds\entities\" 
rewritePrefix="dtd/entities/"/>
  <rewriteSystem systemIdStartString="c:\pap\dtds\" rewritePrefix="dtd/"/>
  <rewriteSystem systemIdStartString=".\entities\" 
rewritePrefix="dtd/entities/"/>
  <rewriteSystem systemIdStartString=".\" rewritePrefix="dtd/"/>
  <rewriteSystem systemIdStartString="" rewritePrefix="dtd/"/>

</catalog>
>> I'm processing US patent application data 
>> from the USPTO using their DTD's:
>>
>>     * us-patent-application-v41-2005-08-25.dtd
>>     * us-patent-application-v40-2004-12-02.dtd
>>     * us-sequence-listing-2004-03-09.dtd
>>     * pap-v16-2002-01-01.dtd
>>     * pap-v15-2001-01-31.dtd
>>     
>
> From a quick perusal these DTDs (including the external entities they 
> reference) look very large. It's not just the entity declarations. Just 
> about everything in these DTDs which match the Name production from the 
> XML spec gets added to the SymbolTable. I assume each document you parse 
> only references one of them. Perhaps it's the sum of the unique names from 
> each of the DTDs which leads to your app running out of memory
Yes they are quite large, however I still think there is a problem because:

1) even when using "java -Xmx7000M" (thats 7 salesman's gigabytes) it 
falls over (whereas 300Mb is enough if I use a new parser for each doc);

2) profiling shows that symbol table entries exist with a continuously 
growing number of different garbage collection generations (new entries 
are continuously being added without the old ones being cleaned up). If 
the cache was working new entries would not be created once each DTD had 
been read once.

Is it possible that I'm messing things up by having xercesImpl-2.8.0 in 
the classpath without pointing to it with -Djava.*endorsed*.*dirs?

Cheers,
   Neil.
*

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: memory leak with DTD entity references?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Neil,

Neil Bacon <ne...@cambia.org> wrote on 08/28/2006 07:37:56 PM:

> Hi Michael,
> Thanks for your reply.
> 
> Michael Glavassevich wrote:
> > Hi Neil,
> >
> > There was a related discussion [1][2] about the SymbolTable on this 
list 
> > back in March 2005.
> Thanks - yes I did come across that thread before posting. Although 
> closely related, I don't think its the same issue because that is about 
> running out of memory parsing a single document and my issue is 
> specifically with reusing the same parser to parse many documents (using 

> a limited set of DTDs). I don't have a problem if I get a new parser for 

> each document.

Whether it's one large document with a million different names or a 
thousand documents with those million names distributed across them it has 
the same effect. The parser's SymbolTable will have all of the names in 
its cache. If this is what's happening you can write an extension to the 
SymbolTable which uses less memory (possibly one which doesn't cache at 
all) and set it on the parser.

> Could the parser be keeping the symbol table from previous documents but 

> not reusing it when it comes across the same DTD in a new document?

A parser instance only has one SymbolTable. The one it has will only be 
replaced if you explicitly replace it by setting a different SymbolTable 
on the parser.

> Perhaps this behaviour could be affected by my use of 
> org.apache.xerces.util.XMLCatalogResolver?

How are you using it?

> > Do these large documents contain similar names or do 
> > they contain many unique names. Specifically do your documents look 
like 
> > this? 
> >
> > Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> 
> > ... 
> > Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> 
> > <elem100000-n/></doc>
> > 
> No the data is not like that. There are a decent number of element names 

> as well as some heavily reused elements. The DTD's contain more than 
> 2000 entity declarations.  I'm processing US patent application data 
> from the USPTO using their DTD's:
> 
>     * us-patent-application-v41-2005-08-25.dtd
>     * us-patent-application-v40-2004-12-02.dtd
>     * us-sequence-listing-2004-03-09.dtd
>     * pap-v16-2002-01-01.dtd
>     * pap-v15-2001-01-31.dtd

>From a quick perusal these DTDs (including the external entities they 
reference) look very large. It's not just the entity declarations. Just 
about everything in these DTDs which match the Name production from the 
XML spec gets added to the SymbolTable. I assume each document you parse 
only references one of them. Perhaps it's the sum of the unique names from 
each of the DTDs which leads to your app running out of memory.

> Cheers,
>     Neil Bacon
>     Cambia
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: memory leak with DTD entity references?

Posted by Neil Bacon <ne...@cambia.org>.
Hi Michael,
Thanks for your reply.

Michael Glavassevich wrote:
> Hi Neil,
>
> There was a related discussion [1][2] about the SymbolTable on this list 
> back in March 2005.
Thanks - yes I did come across that thread before posting. Although 
closely related, I don't think its the same issue because that is about 
running out of memory parsing a single document and my issue is 
specifically with reusing the same parser to parse many documents (using 
a limited set of DTDs). I don't have a problem if I get a new parser for 
each document.

Could the parser be keeping the symbol table from previous documents but 
not reusing it when it comes across the same DTD in a new document? 
Perhaps this behaviour could be affected by my use of 
org.apache.xerces.util.XMLCatalogResolver?
> Do these large documents contain similar names or do 
> they contain many unique names. Specifically do your documents look like 
> this? 
>
> Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> 
> ... 
> Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> 
> <elem100000-n/></doc>
>   
No the data is not like that. There are a decent number of element names 
as well as some heavily reused elements. The DTD's contain more than 
2000 entity declarations.  I'm processing US patent application data 
from the USPTO using their DTD's:

    * us-patent-application-v41-2005-08-25.dtd
    * us-patent-application-v40-2004-12-02.dtd
    * us-sequence-listing-2004-03-09.dtd
    * pap-v16-2002-01-01.dtd
    * pap-v15-2001-01-31.dtd

Cheers,
    Neil Bacon
    Cambia

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


Re: memory leak with DTD entity references?

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Neil,

There was a related discussion [1][2] about the SymbolTable on this list 
back in March 2005. Do these large documents contain similar names or do 
they contain many unique names. Specifically do your documents look like 
this? 

Doc 1: <doc><elem1/> <elem2/> . . . <elem99999/> <elem100000/></doc> 
... 
Doc n: <doc><elem1-n/> <elem2-n/> . . . <elem99999-n/> 
<elem100000-n/></doc>

If they do that would explain why you're running out of memory. The 
SymbolTable will create an entry for each unique name. The last time this 
came up I proposed a workaround [3] (which trades off memory usage at the 
expense of speed).

Thanks.

[1] http://marc.theaimsgroup.com/?t=111099151200003&r=1&w=2
[2] http://marc.theaimsgroup.com/?t=111099151200003&r=2&w=2
[3] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=111103024915201&w=2

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Neil Bacon <ne...@cambia.org> wrote on 08/25/2006 02:55:10 AM:

> Hi,
> I've been running out of memory reusing the same XMLReader 
> (xercesImpl-2.8.0) to parse many large documents.
> The documents reference the same DTD which references many entities.
> Profiling (with netbeans-5.0) reveals that the problem is with char[]s 
> allocated by:
> 
>  org.apache.xerces.util.SymbolTable.$Entry.<init>
>    org.apache.xerces.util.SymbolTable.addSymbol()
>      org.apache.xerces.impl.XMLEntityScanner.scanName()
>        org.apache.xerces.impl.XMLDTDscannerImpl.scanEntityDecl()
> ...
> Maybe its storing the symbol table for the same DTD for each new 
> document and never discarding it?
> Should it recognize a previously parsed DTD and reuse the existing 
> symbol table?
> 
> I've worked around it by using a new XMLReader for each document.
> 
> Can I get DTDs and entities cached to improve performance?
> I'm using org.apache.xerces.util.XMLCatalogResolver.
> 
> Cheers,
>     Neil.
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org