You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Daniel Rabe <dr...@Eloquent.com> on 2003/02/11 19:25:08 UTC

Slow SAX parsing of large CDATA?

I'm using SAX (Xerces 2.3.0 on Windows XP) to parse an XML file that can
contain large CDATA sections (where large is somewhere between 1 and 5 Mb).
The data is Base64-encoded. The code works properly, but when the CDATA is
over 1Mb or so, it's very slow. It seems like a 1Mb CDATA section can be
processed in several seconds, but once it gets up to about 3 or 4 Mb,
processing time goes up to about 10 minutes. It seems like Xerces is
building up a huge buffer of all the data before calling my characters
callback. (I'd prefer to get many characters callbacks so I can stream the
data to a file, rather than accumulating all the data in memory.) This is
the stack crawl I get while it's processing. Garbage collection is also very
active during this process. It doesn't seem to matter whether my max heap is
set to 128Mb or 256Mb... behavior is the same.

at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source) at
org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unkno
wn Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatc
her.dispatch(Unknown Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

Since my data is base64-encoded, I don't really need the CDATA... I can just
treat it like element data. If I do this, I get a characters callback for
each line of the encoded data, and it's wonderfully fast. Unfortunately, the
XML files that I need to process are provided by another vendor and contain
the CDATA.

Has anybody else run into this? Any workarounds, or any way to give xerces a
clue that I want more frequent characters callbacks?

Thanks,
Daniel Rabe
drabe@eloquent.com

Re: Slow SAX parsing of large CDATA?

Posted by Andy Clark <an...@apache.org>.

Daniel Rabe wrote:
> I'm using SAX (Xerces 2.3.0 on Windows XP) to parse an XML file that can 
> contain large CDATA sections (where large is somewhere between 1 and 5 
> Mb). The data is Base64-encoded. The code works properly, but when the 
> CDATA is over 1Mb or so, it's very slow. It seems like a 1Mb CDATA 
> section can be processed in several seconds, but once it gets up to 
> about 3 or 4 Mb, processing time goes up to about 10 minutes. It seems 

The code responsible for scanning CDATA sections was
buffering the contents. So as the CDATA size increased,
the parser needed to create larger and larger buffers
to hold the data, degrading performance. Since the
2.3.0 release, I received another report of this
problem and have fixed the code in CVS.

The next release of the parser will have the fix for
your problem. Or you can grab the latest jar files from
the nightly build:

   http://gump.covalent.net/jars/latest/xml-xerces2/

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org