You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Dan Rosen <Da...@efi.com> on 2004/05/07 21:01:50 UTC

Chunking of characters() callbacks

Hi all,

This might be a bit of an unusual question... I think normally one of the
first thing people ask for when starting to work with an XML parser is, "how
can I make it stop chunking my characters() callbacks?" and the answer
usually is, "well, it's allowed to do that, just aggregate them yourself." In
my case, I'd actually like to *force* Xerces to chunk. I have some truly
horrible, degenerate XML I need to parse, that basically consists of a 70
megabyte block of binary data (Base64 encoded so as not to wreak havoc on XML
parsing) enclosed in an element.

The trouble I'm running into is, when parsing, a buffer in memory for the
characters in this tremendous block of data is being maintained, and is grown
when necessary by XMLBuffer::insureCapacity. This buffer gets so large that
at some point, the allocation in insureCapacity fails, and parsing can't
continue. What I'd like to be able to do is, specify to Xerces that it should
buffer up only a certain maximum amount of character data at a time before
calling sendCharData (in IGXMLScanner::scanCharData), rather than waiting
until it has everything.

As far as I can tell, there isn't a way to do this currently. But I'd like
some feedback as to how easily people think this might be implemented,
whether it's reasonable to do so, etc., and (as a newbie to the Xerces
codebase) hopefully get some assistance in implementing it.

Any help would be much appreciated. I look forward to your answers,
Dan Rosen

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


RE: Chunking of characters() callbacks

Posted by Dean Roddey <dr...@charmedquark.com>.
Personally, I don't think there's probably much to be gained by it *ever*
being over a few K at a time really. So an easy and flexible fix would just
be to max it out at maybe 8K and be done with it. Everyone *has* to be
prepared for the possibility of multiple chunks, even if they only have two
characters worth of data, so no one can complain that this breaks their
code, because if it does, then they weren't compliant anyway. And there's
probably not much performance gain or loss one way or another. And, if you
are looking at this data in a streaming way, you'll find errors sooner and
not do further parsing of redundant data.

-------------------------------------
Dean Roddey
The Charmed Quark Controller
droddey@charmedquark.com
www.charmedquark.com
 


-----Original Message-----
From: Sean Kelly [mailto:sean@f4.ca] 
Sent: Friday, May 07, 2004 8:42 PM
To: xerces-c-dev@xml.apache.org
Subject: Re: Chunking of characters() callbacks


Dan Rosen wrote:
> 
> The trouble I'm running into is, when parsing, a buffer in memory for 
> the characters in this tremendous block of data is being maintained, 
> and is grown when necessary by XMLBuffer::insureCapacity. This buffer 
> gets so large that at some point, the allocation in insureCapacity 
> fails, and parsing can't continue. What I'd like to be able to do is, 
> specify to Xerces that it should buffer up only a certain maximum 
> amount of character data at a time before calling sendCharData (in 
> IGXMLScanner::scanCharData), rather than waiting until it has 
> everything.
> 
> As far as I can tell, there isn't a way to do this currently. But I'd 
> like some feedback as to how easily people think this might be 
> implemented, whether it's reasonable to do so, etc., and (as a newbie 
> to the Xerces
> codebase) hopefully get some assistance in implementing it.

I'm quite interested in this as well.  I asked this question a few 
months ago and didn't get a response.  I tend to work with very large 
XML streams that have substantial chunks of data Base64 encoded as 
character data.  All I'd really like to be able to do is set the 
character buffer size to, say, 4k and not allow it to grow beyond that. 
  This would obviously be for the SAX parser.

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Chunking of characters() callbacks

Posted by Sean Kelly <se...@f4.ca>.
Dan Rosen wrote:
> 
> The trouble I'm running into is, when parsing, a buffer in memory for the
> characters in this tremendous block of data is being maintained, and is grown
> when necessary by XMLBuffer::insureCapacity. This buffer gets so large that
> at some point, the allocation in insureCapacity fails, and parsing can't
> continue. What I'd like to be able to do is, specify to Xerces that it should
> buffer up only a certain maximum amount of character data at a time before
> calling sendCharData (in IGXMLScanner::scanCharData), rather than waiting
> until it has everything.
> 
> As far as I can tell, there isn't a way to do this currently. But I'd like
> some feedback as to how easily people think this might be implemented,
> whether it's reasonable to do so, etc., and (as a newbie to the Xerces
> codebase) hopefully get some assistance in implementing it.

I'm quite interested in this as well.  I asked this question a few 
months ago and didn't get a response.  I tend to work with very large 
XML streams that have substantial chunks of data Base64 encoded as 
character data.  All I'd really like to be able to do is set the 
character buffer size to, say, 4k and not allow it to grow beyond that. 
  This would obviously be for the SAX parser.

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org