You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2018/05/17 15:51:05 UTC
Thread-safety and locking of methods Tika.detect(...) and
MimeType.detect(...)
Hi,
two questions regarding thread-safety and locking in Tika's MIME type detectors
while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].
First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe?
I've found an answer from 2011 about Tika.detect(...)
https://www.mail-archive.com/user@tika.apache.org/msg00296.html
but want to make sure whether this is still true and also applies to
MimeType.detect(...)?
Second, there is a lock (on the jar file) when detecting the MIME type
of XML or HTML documents:
"FetcherThread" #146 daemon ... waiting for monitor entry [0x00007f21b3f45000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
- waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
at java.util.jar.JarFile.getEntry(JarFile.java:240)
at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
...
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown Source)
at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
at org.apache.nutch.protocol.Content.<init>(Content.java:107)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
From 120 threads I've found up to 30 waiting for this lock.
For the stack line
o.a.xerces.parsers.ObjectFactory.createObject(...)
I've found the following discussion
https://www.mail-archive.com/j-users@xerces.apache.org/msg03825.html
which recommends either to reuse the parser (probably hard to get it thread-safe)
or to explicitly set the property "org.apache.xerces.xni.parser.XMLParserConfiguration".
Did anyone see a similar problem?
Thanks,
Sebastian
[1] https://issues.apache.org/jira/browse/NUTCH-2578
Re: Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
Based on the Xerces discussion it sounds like using a pool of parsers
would be the best approach.
Best,
Jukka
On Thu, May 17, 2018 at 11:51 AM, Sebastian Nagel
<wa...@googlemail.com> wrote:
> Hi,
>
> two questions regarding thread-safety and locking in Tika's MIME type detectors
> while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].
>
> First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe?
> I've found an answer from 2011 about Tika.detect(...)
> https://www.mail-archive.com/user@tika.apache.org/msg00296.html
> but want to make sure whether this is still true and also applies to
> MimeType.detect(...)?
>
>
> Second, there is a lock (on the jar file) when detecting the MIME type
> of XML or HTML documents:
>
> "FetcherThread" #146 daemon ... waiting for monitor entry [0x00007f21b3f45000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
> - waiting to lock <0x00000005e03245b8> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:240)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
> ...
> at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
> at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(Unknown Source)
> at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(Unknown Source)
> at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
> at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
> at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
> at org.apache.xerces.parsers.SAXParser.<init>(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.<init>(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl.<init>(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserFactoryImpl.newSAXParser(Unknown Source)
> at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:62)
> at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:42)
> at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
> at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:494)
> at org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:193)
> at org.apache.nutch.protocol.Content.getContentType(Content.java:310)
> at org.apache.nutch.protocol.Content.<init>(Content.java:107)
> at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:321)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
>
> From 120 threads I've found up to 30 waiting for this lock.
> For the stack line
> o.a.xerces.parsers.ObjectFactory.createObject(...)
> I've found the following discussion
> https://www.mail-archive.com/j-users@xerces.apache.org/msg03825.html
> which recommends either to reuse the parser (probably hard to get it thread-safe)
> or to explicitly set the property "org.apache.xerces.xni.parser.XMLParserConfiguration".
>
> Did anyone see a similar problem?
>
>
> Thanks,
> Sebastian
>
>
> [1] https://issues.apache.org/jira/browse/NUTCH-2578