You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2015/03/09 14:47:38 UTC

[jira] [Created] (TIKA-1568) TXTParser performance problem

Andrzej Bialecki  created TIKA-1568:
---------------------------------------

             Summary: TXTParser performance problem
                 Key: TIKA-1568
                 URL: https://issues.apache.org/jira/browse/TIKA-1568
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.7
            Reporter: Andrzej Bialecki 


Performance of parsing many plain text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class). In most cases, when Tika is using the default ServiceLoader instance created in TXTParser, this cost can be avoided by caching the resulting List<EncodingDetector> either at a higher level in TXTParser (e.g. by putting it in ParsingContext) or at a lower level in ServiceLoader.

Relevant part of  the stacktrace follows:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
	at java.util.jar.JarFile.getEntry(JarFile.java:227)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
	at java.util.Collections.list(Collections.java:3687)
	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)