You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/11/09 19:35:51 UTC

XML parsing hang

Hi all,

Just a heads-up that we tracked down a serious issue we were having,  
while parsing about 100M docs.

A handful of these documents caused Tika's parsing to hang. We've got  
a FutureTask that we use to detect and (try to) terminate hung parses.

But for some of these parse attempts, we'd get lingering threads that  
were chewing up huge CPU cycles. They all look like:

> "Thread-14524974" prio=10 tid=0x00002aab18650800 nid=0x6c36 runnable  
> [0x0000000043c2a000]
>    java.lang.Thread.State: RUNNABLE
> 	at org.apache.xerces.impl.XMLScanner.scanExternalID(Unknown Source)
> 	at  
> org 
> .apache.xerces.impl.XMLDocumentScannerImpl.scanDoctypeDecl(Unknown  
> Source)
> 	at org.apache.xerces.impl.XMLDocumentScannerImpl 
> $PrologDispatcher.dispatch(Unknown Source)
> 	at  
> org 
> .apache 
> .xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown  
> Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> 	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown  
> Source)
> 	at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> 	at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> 	at  
> org 
> .apache 
> .tika 
> .detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:47)
> 	at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:236)
> 	at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:536)
> 	at  
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: 
> 128)
> 	at bixo.parser.TikaCallable.call(TikaCallable.java:62)
> 	at bixo.parser.TikaCallable.call(TikaCallable.java:23)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.lang.Thread.run(Thread.java:619)

We traced it to a bug in XercesImpl (we're using 2.9.1), where the  
scanExternalID() method can loop if it gets to the end of the document  
without getting a matching quote character for a literal.

Unfortunately we don't know the exact documents that caused these  
problems, but my guess is that they're not really XML docs, or they're  
badly broken.

This looks like it's been fixed in Xerces 2.10.0, potentially as a  
side effect of the fix for https://issues.apache.org/jira/browse/XERCESJ-1357

-- Ken

PS - Unfortunately 2.10.0 hasn't been pushed to Maven central yet.

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g