You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by ji...@apache.org on 2004/05/10 19:52:56 UTC

[jira] Created: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Message:

  A new issue has been created in JIRA.

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Mon, 10 May 2004 10:51 AM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following comment has been added to this issue:

     Author: Dan Rosen
    Created: Thu, 13 May 2004 10:48 AM
       Body:
Hi Neil,
Thanks for the advice ...

> On the other hand, if you can design a handler that knows how to
> make appropriate calls to the scanner's sendChars() method so that
> the buffer gets flushed when a maximum buffer size is reached, then
> perhaps a pluggable handler wouldn't be necessary since the default
> behaviour would always work when an application has chosen to set
> this limit.

The code I have, as written, does have the pluggable handler notion, since I didn't want to arbitrarily couple the XMLBuffer class implementation (or worse, it's interface) to the scanner. I didn't go so overboard as to allow multiple registered handlers, or anything like that; this seemed simple enough without being a hack.

> I'd also observe that XMLBuffer has to check to use the
> infelicitously named "insureCapacity()" method to make sure it's
> large enough

Yes, this is where I implemented the full-handler invocation. It's more or less as you describe. Once I get everything cleaned up, I'll post the patch here; you'll probably find it to be entirely unsurprising.

I think what I'll end up doing to allow user-configurable buffer size limit will be to add a setter method on Parser and AbstractDOMParser, something like setInputBufferSize(). I think that would be most consistent with the existing API. Sound ok?
---------------------------------------------------------------------
View this comment:
  http://issues.apache.org/jira/browse/XERCESC-1207?page=comments#action_35530

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Thu, 13 May 2004 10:48 AM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following comment has been added to this issue:

     Author: Neil Graham
    Created: Thu, 13 May 2004 6:45 AM
       Body:
Hi Dan.  A few thoughts:  You might want to look at the way the http://apache.org/xml/properties/input-buffer-size property is implemented in Xerces-J, since that seems to aim at doing pretty much what you're after.  Xerces-J has a fairly different design when it comes to scanning, so it might not be directly relevant; but might give you some ideas.

The idea sounds good in general to me.  It sounds like you'll need to define a new property on the parsers for the buffer size, and maybe one for the handler.  On the other hand, if you can design a handler that knows how to make appropriate calls to the scanner's sendChars() method so that the buffer gets flushed when a maximum buffer size is reached, then perhaps a pluggable handler wouldn't be necessary since the default behaviour would always work when an application has chosen to set this limit.  

I'd also observe that XMLBuffer has to check to use the infelicitously named "insureCapacity()" method to make sure it's large enough; you could easily add a boolean to this check so that, instead of expanding its capacity if it's full, if a maximum size has been set then a call would be made to the handler.  This extra check would then only be done in the relatively unlikely situation where a buffer reaches its full size, so certainly wouldn't impact the performance of existing code that doesn't need the new property.
---------------------------------------------------------------------
View this comment:
  http://issues.apache.org/jira/browse/XERCESC-1207?page=comments#action_35524

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Thu, 13 May 2004 6:45 AM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Updated: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following issue has been updated:

    Updater: Dan Rosen (mailto:danr@efi.com)
       Date: Fri, 14 May 2004 12:31 PM
    Comment:
Oops, mangled the previous patch. Hopefully this one should apply properly.
    Changes:
             Attachment changed to inputbuffersize
    ---------------------------------------------------------------------
For a full history of the issue, see:

  http://issues.apache.org/jira/browse/XERCESC-1207?page=history

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Fri, 14 May 2004 12:31 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following comment has been added to this issue:

     Author: Dan Rosen
    Created: Wed, 12 May 2004 2:38 PM
       Body:
I have a fix that's just about ready. You can specify an optional maximum size for an XMLBuffer and specify a handler to be invoked if the limit is reached. The handler should make its best attempt to empty the buffer as appropriate to the task. In this case, the handler for the fCDataBuf (a.k.a. "toUse" in XMLBuffer::scanCharData) invokes XMLBuffer::sendCharData to send a characters() callback and flush the buffer.

This is the cleanest design I can think of: it requires no changes to any of the scanning code in sendCharData or movePlainContentChars to handle the special case of the buffer being full, and it should account for only minimal performance overhead in XMLBuffer's internals (in most cases, only one if-zero comparison).

What I haven't gotten implemented yet is a way to pass a parameter at instantiation time specifying this limit size. I'd like some feedback on how best to do this.
---------------------------------------------------------------------
View this comment:
  http://issues.apache.org/jira/browse/XERCESC-1207?page=comments#action_35512

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Wed, 12 May 2004 2:38 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Updated: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following issue has been updated:

    Updater: Dan Rosen (mailto:danr@efi.com)
       Date: Fri, 14 May 2004 12:28 PM
    Comment:
Added proposed patch for limiting input buffer size. There were a couple things I was undecided on, that I'd like some review for:

- I wasn't sure what would be the most appropriate error code to use in XMLBuffer.cpp, when the buffer could not be resized (I currently use XMLExcepts::Array_BadNewSize).

- I'm not sure what the precedent is for avoiding the ambiguous base class problem (XMemory, specifically) when doing mix-in inheritance. The way I avoided it was to make XMLBufferFullHandler not inherit from XMemory at all, which I assume is fine since it's a pure virtual interface.

- I thought it might be fine to not modify the DOM interfaces to allow custom maximum buffer size, since there is a reasonable default set in the scanner implementation, and since I'd anticipate that DOM users are less memory-constrained typically than SAX users. Also, if it becomes necessary to add this, it will be straightforward to do so later.

Cheers,
dr
    Changes:
             Attachment changed to inputbuffersize
    ---------------------------------------------------------------------
For a full history of the issue, see:

  http://issues.apache.org/jira/browse/XERCESC-1207?page=history

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Fri, 14 May 2004 12:28 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Closed: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by xe...@xml.apache.org.
Message:

   The following issue has been closed.

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Closed
   Priority: Critical
 Resolution: FIXED

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Wed, 29 Sep 2004 12:39 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by ji...@apache.org.
The following comment has been added to this issue:

     Author: Dan Rosen
    Created: Mon, 10 May 2004 12:06 PM
       Body:
I'm thinking maybe the cleanest way to do this would be to encapsulate the "max size" logic in XMLBuffer, such that it throws an exception if the max size is exceeded. That'd keep the parser code clean, especially movePlainContentChars.
---------------------------------------------------------------------
View this comment:
  http://issues.apache.org/jira/browse/XERCESC-1207?page=comments#action_35480

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Unassigned
   Priority: Critical

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Mon, 10 May 2004 12:06 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Resolved: (XERCESC-1207) XMLScanner::scanCharData fills XMLBuffer until out of memory

Posted by xe...@xml.apache.org.
Message:

   The following issue has been resolved as FIXED.

   Resolver: PeiYong Zhang
       Date: Wed, 29 Sep 2004 12:39 PM

Dan Rosen's patch applied and available in cvs.
---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1207

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1207
    Summary: XMLScanner::scanCharData fills XMLBuffer until out of memory
       Type: Bug

     Status: Resolved
   Priority: Critical
 Resolution: FIXED

    Project: Xerces-C++
 Components: 
             Non-Validating Parser
   Versions:
             2.5.0

   Assignee: 
   Reporter: Dan Rosen

    Created: Mon, 10 May 2004 10:51 AM
    Updated: Wed, 29 Sep 2004 12:39 PM

Description:
When parsing an XML file consisting primarily of very large (hundreds of megabytes) blocks of contiguous character data, XMLScanner::scanCharData() happily attempts to build a single XMLBuffer containing all the data. Eventually the buffer becomes so large that the reallocation within XMLBuffer::insureCapacity() fails, causing std::bad_alloc to be thrown, or a crash in memcpy (depending on compiler). The fundamental problem seems to be that there is no upper bound imposed on buffer length.

In the SAX model, it is acceptable to issue multiple ContentHandler::characters() callbacks for a single contiguous block of data. The only restriction on how this should be implemented is that all characters in any single event must come from the same external entity; no further behavior is specified. So it would be perfectly conformant to the SAX model to set an upper bound on the size of a single characters() event.

(As far as I understand, allowing an upper bound in XMLScanner::scanCharData() would not affect the DOM)

I'd propose that an upper bound for character buffer size be added as an optional parameter (with some reasonable value as a default), either in the constructor of the parser or in useScanner(), and that that parameter be used to inform XMLScanner::scanCharData() when to force a call to sendCharData() to dump the buffer to its client.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org