You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Bene (JIRA)" <xe...@xml.apache.org> on 2009/07/06 15:44:15 UTC

[jira] Updated: (XERCESJ-1382) Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData

     [ https://issues.apache.org/jira/browse/XERCESJ-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bene updated XERCESJ-1382:
--------------------------

    Attachment: xerces_performance_problem.xml
                xerces_performance_problem.png

xerces_performance_problem.xml is the XML document which takes too long to parse / validate; xerces_performance_problem.png shows the JRat Performance analysis, which directs to org.apache.xerces.dom.CharacterDataImpl appendData


> Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData
> -------------------------------------------------------------------------
>
>                 Key: XERCESJ-1382
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1382
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.6.0, 2.9.1
>         Environment: Windows XP SP2; JRE 1.6.0_13; Xerces2 Java Parser 2.9.1 Release (Xerces-J-bin.2.9.1.zip)
>            Reporter: Bene
>            Priority: Critical
>         Attachments: xerces_performance_problem.png, xerces_performance_problem.xml
>
>
> It takes too long to parse a large XML Document, if the document contains CDATA sections, which contain embedded XML.
> The problem initially occured with Xerces 2.6.0, where it took about 30 seconds !!! to parse an XML document with about 250 KB.
> So we upgraded to Xerces 2.9.1, which improves parse time to about 5 seconds. Unfortunately this is still much too slow!
> I tried to find similar bug reports and there are many:
> XERCESJ-102
> XERCESJ-1268
> XALANJ-2398
> Unfortunately the issue is still not fixed, so I decided to create this report.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org


Re: [jira] Updated: (XERCESJ-1382) Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Richard,

I've been thinking that XMLEntityScanner.scanData() has been in need of
some work for quite some time. Not only will it constantly break on new
lines, it's also doing an unnecessary array copy for CDATA sections. Like
many things I'd like to do, I just haven't had the time and need to be very
careful with code changes in there because it's a critical part of the
parser.

Can you attach your analysis to the JIRA issue? This will make it much
easier to find later and may also be the only way the reporter (who might
not be subscribed to j-dev@xerces.apache.org) gets notified of the
progress / status.

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Richard Kelly <ra...@gmail.com> wrote on 07/09/2009 08:52:57 AM:

> I took a look at this today.
>
> My initial results were less than 200ms for the problem xml with the
> default settings, but over 30 seconds if I set the "cdata-sections"
> parameter to true before parsing.
>
> I first tried replacing the String concatenation in the
> CharacterDataImpl.appendData() method with a StringBuffer.  This
> reduced the parse time to around 10 seconds but that still seemed
> unacceptably slow.
>
> After profiling, I noticed that the problem appears to be that
> appendData() is just being called far too many times due to the number
> of line breaks in the xerces_performance_problem.xml file.
>
> The scanData() method in org.apache.xerces.impl.XMLEntityScanner
> divides up character data into each line rather than filling up the
> whole buffer.  This results in very small chunks of character data and
> means that appendData() gets called too often.
>
> I tried instructing the scanData() method to ignore line breaks in
> character data, by commenting out these four lines:
>
>             else if (c == '\n' || (external && c == '\r')) {
>                 fCurrentEntity.position--;
>                 break;
>             }
>
> This results in the full buffer being used and fixed the performance
> problem (the problem xml file parsed in well under 1 second).
> Unfortunately removing this code has the side effect of breaking some
> other things (e.g. the Locator), but perhaps someone more familiar
> with the XMLEntityScanner can suggest a way to fix this.
>
> In the meantime, a temporary workaround would be to remove line breaks
> from your CDATA in your problematic xml files before parsing.
>
>
> 2009/7/6 Bene (JIRA) <xe...@xml.apache.org>:
> >
> >     [ https://issues.apache.org/jira/browse/XERCESJ-1382?page=com.
> atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> >
> > Bene updated XERCESJ-1382:
> > --------------------------
> >
> >    Attachment: xerces_performance_problem.xml
> >                xerces_performance_problem.png
> >
> > xerces_performance_problem.xml is the XML document which takes too
> long to parse / validate; xerces_performance_problem.png shows the
> JRat Performance analysis, which directs to org.apache.xerces.dom.
> CharacterDataImpl appendData
> >
> >
> >> Performance problem in org.apache.xerces.dom.CharacterDataImpl
appendData
> >>
-------------------------------------------------------------------------
> >>
> >>                 Key: XERCESJ-1382
> >>                 URL:
https://issues.apache.org/jira/browse/XERCESJ-1382
> >>             Project: Xerces2-J
> >>          Issue Type: Bug
> >>          Components: DOM (Level 3 Core)
> >>    Affects Versions: 2.6.0, 2.9.1
> >>         Environment: Windows XP SP2; JRE 1.6.0_13; Xerces2 Java
> Parser 2.9.1 Release (Xerces-J-bin.2.9.1.zip)
> >>            Reporter: Bene
> >>            Priority: Critical
> >>         Attachments: xerces_performance_problem.png,
> xerces_performance_problem.xml
> >>
> >>
> >> It takes too long to parse a large XML Document, if the document
> contains CDATA sections, which contain embedded XML.
> >> The problem initially occured with Xerces 2.6.0, where it took
> about 30 seconds !!! to parse an XML document with about 250 KB.
> >> So we upgraded to Xerces 2.9.1, which improves parse time to
> about 5 seconds. Unfortunately this is still much too slow!
> >> I tried to find similar bug reports and there are many:
> >> XERCESJ-102
> >> XERCESJ-1268
> >> XALANJ-2398
> >> Unfortunately the issue is still not fixed, so I decided to
> create this report.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> > For additional commands, e-mail: j-dev-help@xerces.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org

Re: [jira] Updated: (XERCESJ-1382) Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData

Posted by Richard Kelly <ra...@gmail.com>.
I took a look at this today.

My initial results were less than 200ms for the problem xml with the
default settings, but over 30 seconds if I set the "cdata-sections"
parameter to true before parsing.

I first tried replacing the String concatenation in the
CharacterDataImpl.appendData() method with a StringBuffer.  This
reduced the parse time to around 10 seconds but that still seemed
unacceptably slow.

After profiling, I noticed that the problem appears to be that
appendData() is just being called far too many times due to the number
of line breaks in the xerces_performance_problem.xml file.

The scanData() method in org.apache.xerces.impl.XMLEntityScanner
divides up character data into each line rather than filling up the
whole buffer.  This results in very small chunks of character data and
means that appendData() gets called too often.

I tried instructing the scanData() method to ignore line breaks in
character data, by commenting out these four lines:

            else if (c == '\n' || (external && c == '\r')) {
                fCurrentEntity.position--;
                break;
            }

This results in the full buffer being used and fixed the performance
problem (the problem xml file parsed in well under 1 second).
Unfortunately removing this code has the side effect of breaking some
other things (e.g. the Locator), but perhaps someone more familiar
with the XMLEntityScanner can suggest a way to fix this.

In the meantime, a temporary workaround would be to remove line breaks
from your CDATA in your problematic xml files before parsing.


2009/7/6 Bene (JIRA) <xe...@xml.apache.org>:
>
>     [ https://issues.apache.org/jira/browse/XERCESJ-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Bene updated XERCESJ-1382:
> --------------------------
>
>    Attachment: xerces_performance_problem.xml
>                xerces_performance_problem.png
>
> xerces_performance_problem.xml is the XML document which takes too long to parse / validate; xerces_performance_problem.png shows the JRat Performance analysis, which directs to org.apache.xerces.dom.CharacterDataImpl appendData
>
>
>> Performance problem in org.apache.xerces.dom.CharacterDataImpl appendData
>> -------------------------------------------------------------------------
>>
>>                 Key: XERCESJ-1382
>>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1382
>>             Project: Xerces2-J
>>          Issue Type: Bug
>>          Components: DOM (Level 3 Core)
>>    Affects Versions: 2.6.0, 2.9.1
>>         Environment: Windows XP SP2; JRE 1.6.0_13; Xerces2 Java Parser 2.9.1 Release (Xerces-J-bin.2.9.1.zip)
>>            Reporter: Bene
>>            Priority: Critical
>>         Attachments: xerces_performance_problem.png, xerces_performance_problem.xml
>>
>>
>> It takes too long to parse a large XML Document, if the document contains CDATA sections, which contain embedded XML.
>> The problem initially occured with Xerces 2.6.0, where it took about 30 seconds !!! to parse an XML document with about 250 KB.
>> So we upgraded to Xerces 2.9.1, which improves parse time to about 5 seconds. Unfortunately this is still much too slow!
>> I tried to find similar bug reports and there are many:
>> XERCESJ-102
>> XERCESJ-1268
>> XALANJ-2398
>> Unfortunately the issue is still not fixed, so I decided to create this report.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-dev-help@xerces.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org