You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Larry West (JIRA)" <xe...@xml.apache.org> on 2006/01/24 03:50:40 UTC

[jira] Updated: (XERCESC-1556) Severe performance problem validating with large schema

     [ http://issues.apache.org/jira/browse/XERCESC-1556?page=all ]

Larry West updated XERCESC-1556:
--------------------------------

    Attachment: xerces-test-pseudocode.txt
                xerces-c-gprof-analysis.txt
                xerces-c-gprof-analysis.txt

These are the attachments mentioned in the main "issue" text: psuedo-code, gprof output, and my analysis of the gprof output (and my own instrumentation).

> Severe performance problem validating with large schema
> -------------------------------------------------------
>
>          Key: XERCESC-1556
>          URL: http://issues.apache.org/jira/browse/XERCESC-1556
>      Project: Xerces-C++
>         Type: Bug
>   Components: Validating Parser (Schema) (Xerces 1.5 or up only)
>     Versions: 2.4.0, 2.7.0
>  Environment: HP-UX 11.11 on PA-RISC (HP 9000/800); C++ compiler is aCC vA.03.45.
>     Reporter: Larry West
>  Attachments: xerces-c-gprof-analysis.txt, xerces-c-gprof-analysis.txt, xerces-test-pseudocode.txt
>
> (I will try to attach a separate file with the C++ application pseudo-code that experiences the performance problem: xerces-test-pseudocode.txt.)
> The problem was observed against both the 2.4.0 and 2.7.0 versions of Xerces-C, running in a single-threaded application on an unloaded server.
> The schema we are validating against is huge, but publically available, so I'll just provide a URL.  There are actually several very similar versions of this, named such as "2004v3.0" and "2005v1.2".   There are about 536 files in the 2005v2.0 version, about 4.75MB, though I don't know how much of that is actively in use (a lot of it is, though).  All recent version see the same performance problems.
> A general page is at: http://www.irs.gov/efile/article/0,,id=128360,00.html
> The schema giving us problems is contained in the Zip file efile1120x_2005v2.0.zip, URL=
> 	http://www.irs.gov/pub/irs-schema/efile1120x_2005v2.0.zip
> When you expand this, the directory structure is, of course, important.  The "2005v2.0" directory tree contains the top-level schema in question (for the 1120 business returns) at:
> 	2005v2.0/CorporateIncomeTax/Corp1120/Return1120.xsd
> The data files (business income tax returns) that are validated against this can be over a megabyte in size, though I don't know how much that affects the time to validate (that is, I assume the time does depend on the size, but I haven't measured the relation between the two).
> The problem:
> The problem is that it takes 2-4 hours to validate schema on a fairly high-performing platform.  For comparison, using Xerces-J v2.7.1 to do the same validation normally takes under a minute (though four times the memory).
> I believe I have identified the areas causing the problem, which are repeated sequential lookups through lists that have 2000+ elements.  And in most cases, my testing shows that there is never a match to any of these lookups.   I was planning on introducing a hash-map to cache the results of the first lookup, but using Xerces-J turned out to be a more practical approach in my case.
> So, what follows are my notes from the debugging and performance instrumentation I've done.
> Apparently key point: the "higher-level" (4-param) SchemaInfo::getTopLevelComponent() is called 4920 times, but calls the "lower-level" (3-param) one 1.78M times because (here's pseudo code for the 4-param version):
>     //== get here 4920 times 
>     DOMElement* child = getTopLevelComponent(compCategory, compName, name);
>     if ( child == 0) 
>     {   //== get here 4159 times
>         listSize = fIncludeInfoList->size();
>         //== listSize always 427 --> number of include files
>         for ( i = 0 ; i < listSize ; ++i ) {
>             SchemaInfo *ptr = fIncludeInfoList[i];
>             child = ptr->getTopLevelComponent(compCategory, compName, name);
>             //== the above NEVER succeeds.  It's called 4159*427 (1.78M) times.
>         }
>     }
> Part of my investigation involved using gprof; I will try to attach my conclusions from that as a separate attachment ("xerces-c-gprof-analysis.txt"), and the gprof output (which is large, hence zipped) as a 2nd attachment, "xerces-c-gprof-out.zip".
> Other notes:
> From casual observation, it appears that very little of the time is spent doing I/O.  It appears that the Schema (all its files) are read in once.   I'm not sure though, whether that happens very quickly at the beginning, or whether it's spread out over the 2 hour run.
> Also, the memory usage rises up to about 64MB reasonably early in the process (matter of minutes), then stays flat... which also suggests to me that it has finished parsing the schema files early on.  [As I stated earlier, Xerces-J takes under a minute to do this.  It grows to ~256MB early on and stays flat after that.] 
> If a sample data file is needed for investigation, let me know and I'll get one.
> Larry West
> Intuit, Inc
> Consumer Tax Group

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org