You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2018/05/17 15:51:05 UTC

Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

Hi,

two questions regarding thread-safety and locking in Tika's MIME type detectors
while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].

First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe?
I've found an answer from 2011 about Tika.detect(...)
   https://www.mail-archive.com/user@tika.apache.org/msg00296.html
but want to make sure whether this is still true and also applies to
MimeType.detect(...)?


Second, there is a lock (on the jar file) when detecting the MIME type
of XML or HTML documents:

 "FetcherThread" #146 daemon ... waiting for monitor entry [0x00007f21b3f45000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
        - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:240)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
        ...
        at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
        at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
        at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
        at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown Source)
        at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
        at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
        at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
        at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
        at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
        at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
        at org.apache.nutch.protocol.Content.<init>(Content.java:107)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)

From 120 threads I've found up to 30 waiting for this lock.
For the stack line
  o.a.xerces.parsers.ObjectFactory.createObject(...)
I've found the following discussion
  https://www.mail-archive.com/j-users@xerces.apache.org/msg03825.html
which recommends either to reuse the parser (probably hard to get it thread-safe)
or to explicitly set the property "org.apache.xerces.xni.parser.XMLParserConfiguration".

Did anyone see a similar problem?


Thanks,
Sebastian


[1] https://issues.apache.org/jira/browse/NUTCH-2578

Re: Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Based on the Xerces discussion it sounds like using a pool of parsers
would be the best approach.

Best,

Jukka

On Thu, May 17, 2018 at 11:51 AM, Sebastian Nagel
<wa...@googlemail.com> wrote:
> Hi,
>
> two questions regarding thread-safety and locking in Tika's MIME type detectors
> while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].
>
> First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe?
> I've found an answer from 2011 about Tika.detect(...)
>    https://www.mail-archive.com/user@tika.apache.org/msg00296.html
> but want to make sure whether this is still true and also applies to
> MimeType.detect(...)?
>
>
> Second, there is a lock (on the jar file) when detecting the MIME type
> of XML or HTML documents:
>
>  "FetcherThread" #146 daemon ... waiting for monitor entry [0x00007f21b3f45000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
>         - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         ...
>         at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>         at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
>         at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
>         at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
>         at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
>         at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
>         at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown Source)
>         at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
>         at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
>         at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
>         at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
>         at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
>         at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:107)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
>
> From 120 threads I've found up to 30 waiting for this lock.
> For the stack line
>   o.a.xerces.parsers.ObjectFactory.createObject(...)
> I've found the following discussion
>   https://www.mail-archive.com/j-users@xerces.apache.org/msg03825.html
> which recommends either to reuse the parser (probably hard to get it thread-safe)
> or to explicitly set the property "org.apache.xerces.xni.parser.XMLParserConfiguration".
>
> Did anyone see a similar problem?
>
>
> Thanks,
> Sebastian
>
>
> [1] https://issues.apache.org/jira/browse/NUTCH-2578