You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-users@xerces.apache.org by Adrian Mcmenamin <ac...@york.ac.uk> on 2013/11/10 10:34:35 UTC

SAX2 XML file size limit?

I am parsing a very large file and the parsing seems to fail on a (bogus)
fatal error once I get over about 2^16 lines (on line 4,295,025,275 to be
precise). Is there a hard limit on file sizes that can be parsed?

Re: SAX2 XML file size limit?

Posted by Alberto Massari <al...@tiscali.it>.
The SAX2 parser is implemented in xercesc/parsers/SAX2XMLReaderImpl.cpp; 
it then reuses internal classes like xercesc/internal/SGXMLScanner.cpp, 
that in turn uses XMLReader.cpp

Alberto

Il 11/11/13 15:30, Adrian Mcmenamin ha scritto:
> On 10 November 2013 19:10, Adrian Mcmenamin <ac...@york.ac.uk> wrote:
>
>>
>>
>> On 10 November 2013 15:56, Alberto Massari <al...@tiscali.it>wrote:
>>
>>> Which version are you using on which operating system? Also, I guess you
>>> compiled Xerces on a 64 bit system.
>>>
>>>
>> This is xerces-c-3.1.1 (downloaded and built in the last few days),
>> running on a Linux amd64 3.9.3 kernel
>>
>> Yes, it's all 64 bit
>>
>
>
> Where can I find the sources for the parser if I want to look at whether
> this is an overflow error of some sort (as that is what it appears to be on
> first blush)?
>


Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
On 10 November 2013 19:10, Adrian Mcmenamin <ac...@york.ac.uk> wrote:

>
>
>
> On 10 November 2013 15:56, Alberto Massari <al...@tiscali.it>wrote:
>
>> Which version are you using on which operating system? Also, I guess you
>> compiled Xerces on a 64 bit system.
>>
>>
> This is xerces-c-3.1.1 (downloaded and built in the last few days),
> running on a Linux amd64 3.9.3 kernel
>
> Yes, it's all 64 bit
>



Where can I find the sources for the parser if I want to look at whether
this is an overflow error of some sort (as that is what it appears to be on
first blush)?

Re: SAX2 XML file size limit?

Posted by Rob Cameron <ro...@international-characters.com>.
Hi, Adrian.

You can scp it to us.  I'll send the pw separately.

scp "somefile" oslguest@cs-osl-04.cs.surrey.sfu.ca:/home/oslguest/


On Mon, Nov 11, 2013 at 2:55 PM, Adrian Mcmenamin <ac...@york.ac.uk> wrote:

> On 11 November 2013 20:45, Rob Cameron <robc@international-characters.com
> >wrote:
>
> > Hi, Adrian.
> >
> > We would be interested in looking into this issue with both Xerces-3.1.1
> > and icXML.    (icXML is a high-performance version of Xerces, accelerated
> > with Parabix technology).     We're quite interested in addressing issues
> > for large XML documents.
> >
> > If you can make your data available to us, we can attempt to duplicate
> > the bug and also give you performance reports.
> >
> > We're presently working toware icXML-1.0, but you can check out
> icXML-0.9.
> >
> > svn co http://parabix.costar.sfu.ca/svn/icXML/icXML-0.9
> >
> > Build with
> > cd icXML-0.9
> > ./configure
> > make
> >
> >
> >
> I am happy to make the data available if at all practical! I have a 5GB or
> so bz2 compressed version of it somewhere (the uncompressed data is about
> 225GB) - how would you suggest I did this? If necessary I suppose this
> could be via post if all else fails.
>
> The code that is having trouble is available at
> https://github.com/mcmenaminadrian/jalan - as you will see it doesn't do
> much at the moment - as I am really trying to gauge if my approach is
> practical.
>
> Thanks for the tip about icXML
>
> Adrian
>

Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
On 11 November 2013 20:45, Rob Cameron <ro...@international-characters.com>wrote:

> Hi, Adrian.
>
> We would be interested in looking into this issue with both Xerces-3.1.1
> and icXML.    (icXML is a high-performance version of Xerces, accelerated
> with Parabix technology).     We're quite interested in addressing issues
> for large XML documents.
>
> If you can make your data available to us, we can attempt to duplicate
> the bug and also give you performance reports.
>
> We're presently working toware icXML-1.0, but you can check out icXML-0.9.
>
> svn co http://parabix.costar.sfu.ca/svn/icXML/icXML-0.9
>
> Build with
> cd icXML-0.9
> ./configure
> make
>
>
>
I am happy to make the data available if at all practical! I have a 5GB or
so bz2 compressed version of it somewhere (the uncompressed data is about
225GB) - how would you suggest I did this? If necessary I suppose this
could be via post if all else fails.

The code that is having trouble is available at
https://github.com/mcmenaminadrian/jalan - as you will see it doesn't do
much at the moment - as I am really trying to gauge if my approach is
practical.

Thanks for the tip about icXML

Adrian

Re: SAX2 XML file size limit?

Posted by Rob Cameron <ro...@international-characters.com>.
Hi, Adrian.

We would be interested in looking into this issue with both Xerces-3.1.1
and icXML.    (icXML is a high-performance version of Xerces, accelerated
with Parabix technology).     We're quite interested in addressing issues
for large XML documents.

If you can make your data available to us, we can attempt to duplicate
the bug and also give you performance reports.

We're presently working toware icXML-1.0, but you can check out icXML-0.9.

svn co http://parabix.costar.sfu.ca/svn/icXML/icXML-0.9

Build with
cd icXML-0.9
./configure
make


On Sun, Nov 10, 2013 at 11:10 AM, Adrian Mcmenamin <ac...@york.ac.uk>wrote:

> On 10 November 2013 15:56, Alberto Massari <albertomassari@tiscali.it
> >wrote:
>
> > Which version are you using on which operating system? Also, I guess you
> > compiled Xerces on a 64 bit system.
> >
> >
> This is xerces-c-3.1.1 (downloaded and built in the last few days), running
> on a Linux amd64 3.9.3 kernel
>
> Yes, it's all 64 bit
>

Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
On 10 November 2013 15:56, Alberto Massari <al...@tiscali.it>wrote:

> Which version are you using on which operating system? Also, I guess you
> compiled Xerces on a 64 bit system.
>
>
This is xerces-c-3.1.1 (downloaded and built in the last few days), running
on a Linux amd64 3.9.3 kernel

Yes, it's all 64 bit

Re: SAX2 XML file size limit?

Posted by Alberto Massari <al...@tiscali.it>.
Which version are you using on which operating system? Also, I guess you 
compiled Xerces on a 64 bit system.

Alberto

Il 10/11/13 10:36, Adrian Mcmenamin ha scritto:
> That was obviously meant to state 2^32, sorry
>
>
> On 10 November 2013 09:34, Adrian Mcmenamin <ac...@york.ac.uk> wrote:
>
>> I am parsing a very large file and the parsing seems to fail on a (bogus)
>> fatal error once I get over about 2^16 lines (on line 4,295,025,275 to be
>> precise). Is there a hard limit on file sizes that can be parsed?
>>


Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
On 13 November 2013 08:16, Adrian Mcmenamin <ac...@york.ac.uk> wrote:

>
>
>
> On 10 November 2013 09:36, Adrian Mcmenamin <ac...@york.ac.uk> wrote:
>
>> That was obviously meant to state 2^32, sorry
>>
>>
>> On 10 November 2013 09:34, Adrian Mcmenamin <ac...@york.ac.uk> wrote:
>>
>>> I am parsing a very large file and the parsing seems to fail on a
>>> (bogus) fatal error once I get over about 2^16 lines (on line 4,295,025,275
>>> to be precise). Is there a hard limit on file sizes that can be parsed?
>>>
>>
>>
> I deleted approximately 1 billion lines from the file, while ensuring it
> was still well-formed XML, and this time the parse failed on line
> 4,295,015,171 - so I am very confident there is some sort of overflow bug
> in xerces-c's handling of very large XML files.
>


For what it's worth - the parse fails in exactly the same way (in C++) when
the default handler is used - ie when nothing at all is being done, but the
Java parser can happily handle the whole file. So all the evidence points
to a bug somewhere in the C++ implementation.

Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
On 10 November 2013 09:36, Adrian Mcmenamin <ac...@york.ac.uk> wrote:

> That was obviously meant to state 2^32, sorry
>
>
> On 10 November 2013 09:34, Adrian Mcmenamin <ac...@york.ac.uk> wrote:
>
>> I am parsing a very large file and the parsing seems to fail on a (bogus)
>> fatal error once I get over about 2^16 lines (on line 4,295,025,275 to be
>> precise). Is there a hard limit on file sizes that can be parsed?
>>
>
>
I deleted approximately 1 billion lines from the file, while ensuring it
was still well-formed XML, and this time the parse failed on line
4,295,015,171 - so I am very confident there is some sort of overflow bug
in xerces-c's handling of very large XML files.

Re: SAX2 XML file size limit?

Posted by Adrian Mcmenamin <ac...@york.ac.uk>.
That was obviously meant to state 2^32, sorry


On 10 November 2013 09:34, Adrian Mcmenamin <ac...@york.ac.uk> wrote:

> I am parsing a very large file and the parsing seems to fail on a (bogus)
> fatal error once I get over about 2^16 lines (on line 4,295,025,275 to be
> precise). Is there a hard limit on file sizes that can be parsed?
>