You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Adam Rauch (JIRA)" <ji...@apache.org> on 2010/01/27 00:59:37 UTC

[jira] Created: (TIKA-374) AutoDetectParser not thread-safe?

AutoDetectParser not thread-safe?
---------------------------------

                 Key: TIKA-374
                 URL: https://issues.apache.org/jira/browse/TIKA-374
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 0.5
         Environment: Dell E6400 (dual-core) running 64-bit Windows 7.  Also reproduced on an 8-processor Mac OS/X server.
            Reporter: Adam Rauch


We are using Tika 0.5 to parse files that are added to a Lucene index.  If we assign multiple threads to the parsing task we find that the AutoDetectParser.parse() method occasionally fails to return.  In our case, it appears that a HashMap inside Xerces gets corrupted, causing an infinite loop inside HashMap.get().  This seems to be a concurrency problem; we have not seen the issue when running single threaded.

Other posts have stated that AutoDetectParser is thread-safe.  A quick look at the source code shows that an AutoDetectParser holds a MimeTypes which holds an XmlRootExtractor which holds a SAXParser.  As a result, a single SAXParser instance can end up simultaneously parsing documents in multiple threads.  The Java 1.4 SAXParser JavaDoc clearly states that "An implementation of SAXParser is NOT guaranteed to behave as per the specification if it is used concurrently by two or more threads."  More recent versions of the JavaDoc have removed the warning, though the presence of "setProperty()" certainly means that a SAXParser is not immutable.  As you can see from the stack trace below, properties seem to be the issue in this case.

We've tried to work around the issue by constructing a new AutoDetectParser for each file we parse, but this doesn't solve the problem.  Multiple AutoDectectParsers can still end up sharing a single instance of MimeTypes, because TikaConfig holds a MimeTypes instance statically (??) and updates it without synchronization (??).

java.lang.Thread.State: RUNNABLE
             at java.util.HashMap.get(HashMap.java:303)
             at org.apache.xerces.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:224)
             at org.apache.xerces.impl.dtd.XMLDTDProcessor.reset(XMLDTDProcessor.java:344)
             at org.apache.xerces.parsers.XML11Configuration.reset(XML11Configuration.java:984)
             at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:806)
             at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
             at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
             at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
             at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
             at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
             at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
             at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:63)
             at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:237)
             at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:534)
             at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:92)
             at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
             at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:170)
             at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:664)
             at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:737)
             at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:773)
             at java.lang.Thread.run(Thread.java:637)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-374) AutoDetectParser not thread-safe?

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-374.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7
         Assignee: Jukka Zitting

Thanks for the accurate analysis of the problem. I fixed this in revision 903775 by making each call to XmlRootExtractor.extractRootElement() use a new SAXParser instance.

> AutoDetectParser not thread-safe?
> ---------------------------------
>
>                 Key: TIKA-374
>                 URL: https://issues.apache.org/jira/browse/TIKA-374
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.5
>         Environment: Dell E6400 (dual-core) running 64-bit Windows 7.  Also reproduced on an 8-processor Mac OS/X server.
>            Reporter: Adam Rauch
>            Assignee: Jukka Zitting
>             Fix For: 0.7
>
>
> We are using Tika 0.5 to parse files that are added to a Lucene index.  If we assign multiple threads to the parsing task we find that the AutoDetectParser.parse() method occasionally fails to return.  In our case, it appears that a HashMap inside Xerces gets corrupted, causing an infinite loop inside HashMap.get().  This seems to be a concurrency problem; we have not seen the issue when running single threaded.
> Other posts have stated that AutoDetectParser is thread-safe.  A quick look at the source code shows that an AutoDetectParser holds a MimeTypes which holds an XmlRootExtractor which holds a SAXParser.  As a result, a single SAXParser instance can end up simultaneously parsing documents in multiple threads.  The Java 1.4 SAXParser JavaDoc clearly states that "An implementation of SAXParser is NOT guaranteed to behave as per the specification if it is used concurrently by two or more threads."  More recent versions of the JavaDoc have removed the warning, though the presence of "setProperty()" certainly means that a SAXParser is not immutable.  As you can see from the stack trace below, properties seem to be the issue in this case.
> We've tried to work around the issue by constructing a new AutoDetectParser for each file we parse, but this doesn't solve the problem.  Multiple AutoDectectParsers can still end up sharing a single instance of MimeTypes, because TikaConfig holds a MimeTypes instance statically (??) and updates it without synchronization (??).
> java.lang.Thread.State: RUNNABLE
>              at java.util.HashMap.get(HashMap.java:303)
>              at org.apache.xerces.util.ParserConfigurationSettings.getProperty(ParserConfigurationSettings.java:224)
>              at org.apache.xerces.impl.dtd.XMLDTDProcessor.reset(XMLDTDProcessor.java:344)
>              at org.apache.xerces.parsers.XML11Configuration.reset(XML11Configuration.java:984)
>              at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:806)
>              at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
>              at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
>              at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
>              at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
>              at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
>              at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>              at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:63)
>              at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:237)
>              at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:534)
>              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:92)
>              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
>              at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:170)
>              at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:664)
>              at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:737)
>              at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:773)
>              at java.lang.Thread.run(Thread.java:637)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.