You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Deepak Kumar (JIRA)" <xe...@xml.apache.org> on 2014/02/04 04:02:08 UTC

[jira] [Comment Edited] (XERCESJ-1276) Improve performance of XML Schema Identity-constraint validation --- XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.

    [ https://issues.apache.org/jira/browse/XERCESJ-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888473#comment-13888473 ] 

Deepak Kumar edited comment on XERCESJ-1276 at 2/4/14 3:00 AM:
---------------------------------------------------------------

I am experiencing very similar problem but with a significantly larger impact, attached is the zip holding binary with pulled patch (which does effective usage of hashCode() and equals()), below is the binary manifest snippet

Manifest-Version: 1.0
Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012
Created-By: 1.6.0_32-ea (Sun Microsystems Inc.)

Problem details:
----------------

I have a compressed input stream file of roughly 25M (24.4Mib) holding xml, compression is achieved using java.util.zip compression/decompression api's with the default strategy, and I am sure the file could go anywhere close to 500M inflated.

A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java 1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5 2.5Ghz processor laptop (nothing in this entire process is parallelized for any kind of optimization reasons). This has been checked and confirmed with explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take control of the entire parsing AND even on java default's parsing implementation which is very much the same as seen in xerces.

During multiple execution below code in xerces has been identified as potential hotspot (via multiple profiling tools) choking up entirely and is happening due to somewhat bad nested looping in the code with significantly larger value indexes (potentially in MB's) and also gets aligned with the comment.

org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains()
            // REVISIT: we can improve performance by using hash codes, instead of
            // traversing global vector that could be quite large.
            ..........


[NOTE] Interestingly the same piece of code runs perfectly (with both jdk and xerces implementation) within a minute via Eclipse and even on the very plain \" java -classpath ... ParserTest \" without any significant JVM hotspot indications which makes a matter of worry on whether Tomcat internally is doing something during the entire parsing???

As of now I am able to run it within a minute inside Tomcat also, binary pairs can be used as a drop-in replacement for people facing such problem.

[ATTENTION] On a different angle with the existing xerces binaries if the application attempts to re-process the xmls, even in a different thread, then it severly impacts the execution of other operational threads, thus the entire webapp appears to start freezing randomly, and strangely takes even much higher time to do the parsing (close to 2x time) even with enough memory allocation. I am not sure whether the issue will persist with other other application servers like glassfish or jetty OR it's purely binded to Tomcat.

--Deepak


was (Author: deepaksrivastavaz@gmail.com):
I am experiencing very similar problem but with a significantly larger impact, attached is the zip holding binary with pulled patch (which does effective usage of hashCode() and equals()), below is the binary manifest snippet

Manifest-Version: 1.0
Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012
Created-By: 1.6.0_32-ea (Sun Microsystems Inc.)

Problem details:
----------------

I have a compressed input stream file of roughly 25M (24.4Mib) holding xml, compression is achieved using java.util.zip compression/decompression api's with the default strategy, and I am sure the file could go anywhere close to 500M deflated.

A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java 1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5 2.5Ghz processor laptop (nothing in this entire process is parallelized for any kind of optimization reasons). This has been checked and confirmed with explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take control of the entire parsing AND even on java default's parsing implementation which is very much the same as seen in xerces.

During multiple execution below code in xerces has been identified as potential hotspot (via multiple profiling tools) choking up entirely and is happening due to somewhat bad nested looping in the code with significantly larger value indexes (potentially in MB's) and also gets aligned with the comment.

org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains()
            // REVISIT: we can improve performance by using hash codes, instead of
            // traversing global vector that could be quite large.
            ..........


[NOTE] Interestingly the same piece of code runs perfectly (with both jdk and xerces implementation) within a minute via Eclipse and even on the very plain \" java -classpath ... ParserTest \" without any significant JVM hotspot indications which makes a matter of worry on whether Tomcat internally is doing something during the entire parsing???

As of now I am able to run it within a minute inside Tomcat also, binary pairs can be used as a drop-in replacement for people facing such problem.

[ATTENTION] On a different angle with the existing xerces binaries if the application attempts to re-process the xmls, even in a different thread, then it severly impacts the execution of other operational threads, thus the entire webapp appears to start freezing randomly, and strangely takes even much higher time to do the parsing (close to 2x time) even with enough memory allocation. I am not sure whether the issue will persist with other other application servers like glassfish or jetty OR it's purely binded to Tomcat.

--Deepak

> Improve performance of XML Schema Identity-constraint validation --- XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESJ-1276
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1276
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: XML Schema 1.0 Structures
>    Affects Versions: 2.6.2, 2.9.1
>            Reporter: Kenny MacLeod
>              Labels: gsoc, gsoc2013, mentor
>         Attachments: XMLSchemaValidator.java, Xerces-J-src.2.11.0_patch1276.txt, xerces-binaries-patched-over-2.11.0.zip, xerces-value-store.txt
>
>
> Under certain conditions, the contains() method in XMLSchemaValidator$ValueStoreBase can cripple the performance of parsing and validation.
> I'm not sure what those conditions are, but as a guideline figure I was using JAXB2 to deserialize a 22meg XML file.  Without schema validation, it took 5 seconds.  With validation, it took over 3 minutes (JDK 1.5.0_10 on win32). My profiler pointed the finger squarely at that method XMLSchemaValidator.
> Suspicions were aroused further when seeing this comment in the source:
> public boolean contains() {
>             // REVISIT: we can improve performance by using hash codes, instead of
>             // traversing global vector that could be quite large.
> This is present in Xerces 2.6.2 contained with JDK1.5.0_10, and also in the source for 2.9.1.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org