You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/07/11 18:29:00 UTC

[jira] [Resolved] (TIKA-1568) AutoDetectReader performance problem

     [ https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-1568.
-------------------------------
       Resolution: Fixed
         Assignee: Tim Allison
    Fix Version/s: 1.22

We added {{AbstractEncodingDetectorParser}} a few versions ago, and I just added caching of the detectors in AutoDetectingReader for default initialization: {{AutoDetectReader(InputStream is(, Metadata))}}.

> AutoDetectReader performance problem
> ------------------------------------
>
>                 Key: TIKA-1568
>                 URL: https://issues.apache.org/jira/browse/TIKA-1568
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Andrzej Bialecki 
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.22
>
>
> Parsing performance of many text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using the default ServiceLoader instance created in the Parser's static section this cost can be avoided by caching the resulting List<EncodingDetector> either at a higher level in the Parser (as a static property). If using custom ServiceLoader-s this can be achieved by putting this list in ParsingContext, or caching these lists at a lower level in the ServiceLoader component.
> Relevant part of  the stacktrace follows:
> {code}
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
> 	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
> 	at java.util.jar.JarFile.getEntry(JarFile.java:227)
> 	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
> 	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
> 	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
> 	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
> 	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
> 	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
> 	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
> 	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
> 	at java.util.Collections.list(Collections.java:3687)
> 	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
> 	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
> 	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
> 	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
> 	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
> 	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
> 	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
> 	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)