You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2003/11/10 05:01:12 UTC

DO NOT REPLY [Bug 897] - Memory leak reading large XML-files with SAX parser

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=897>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=897

Memory leak reading large XML-files with SAX parser





------- Additional Comments From anewman@pisoftware.com  2003-11-10 04:01 -------
This is still a problem in the lastest version of Xerces (2.5).  The number
"java.io.StringReader" increases until it runs out of memory - they are never
able to be garbage collected.

Here's some sample RDF/XML:<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rdf:RDF [
<!ENTITY math  "http://kowari.org/math#">
<!ENTITY owl   "http://www.w3.org/2002/07/owl#">
<!ENTITY rdf   "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ENTITY rdfs  "http://www.w3.org/2000/01/rdf-schema#">
<!ENTITY xsd   "http://www.w3.org/2001/XMLSchema#">
]>

<rdf:RDF xmlns:math ="&math;"
	 xmlns:owl  ="&owl;"
	 xmlns:rdf  ="&rdf;"
	 xmlns:rdfs ="&rdfs;">
<rdf:Description>
  <owl:sameIndividualAs rdf:datatype="&xsd;integer">14</owl:sameIndividualAs>
  <rdfs:label xml:lang="en">fourteen</rdfs:label>
  <math:roman>XIV</math:roman>
  <math:square rdf:datatype="&xsd;integer">196</math:square>
  <math:primeFactorization>
    <rdf:Bag>
      <rdf:li rdf:datatype="&xsd;integer">2</rdf:li>
      <rdf:li rdf:datatype="&xsd;integer">7</rdf:li>
    </rdf:Bag>
  </math:primeFactorization>
</rdf:Description>
<rdf:Description>
  <owl:sameIndividualAs rdf:datatype="&xsd;integer">15</owl:sameIndividualAs>
  <rdfs:label xml:lang="en">fifteen</rdfs:label>
  <math:roman>XV</math:roman>
  <math:square rdf:datatype="&xsd;integer">225</math:square>
  <math:primeFactorization>
    <rdf:Bag>
      <rdf:li rdf:datatype="&xsd;integer">3</rdf:li>
      <rdf:li rdf:datatype="&xsd;integer">5</rdf:li>
    </rdf:Bag>
  </math:primeFactorization>
  <rdf:type rdf:resource="&math;TriangularNumber"/>
</rdf:Description>
</rdf:RDF>

When you inline all of the references, then it only ever has 4 objects
allocated.  For example:
<rdf:Description>
  <owl:sameIndividualAs
rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">14</owl:sameIndividualAs>
  <rdfs:label xml:lang="en">fourteen</rdfs:label>
  <math:roman>XIV</math:roman>
  <math:square
rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">196</math:square>
  <math:primeFactorization>
    <rdf:Bag>
      <rdf:li rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</rdf:li>
      <rdf:li rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">7</rdf:li>
    </rdf:Bag>
  </math:primeFactorization>
</rdf:Description>


Here's a report from Optimize It after parsing a large amount of this XML:
2509 instances of java.io.StringReader  allocated.
   100.0% org.apache.xerces.impl.XMLEntityManager.startEntity()
      100.0% org.apache.xerces.impl.XMLScanner.scanAttributeValue()
         100.0%
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute()
            100.0%
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement()
               99.84%
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch()
                  99.84%
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument()
                     99.84% org.apache.xerces.parsers.DTDConfiguration.parse()

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org