You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/03/15 04:35:38 UTC

[jira] [Commented] (TIKA-1149) Improve parser lookup performance

    [ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362181#comment-14362181 ] 

Tyler Palsulich commented on TIKA-1149:
---------------------------------------

It's been two years since the last update. I'm interested in a speedup, but I'm not sure this is the best spot to do it. Also, we're thinking of moving away from service loading come Tika 2.0. So, I'm not sure it's worth bringing in these breaking changes.

I'll close as Won't Fix unless someone objects over the next week or so.

> Improve parser lookup performance
> ---------------------------------
>
>                 Key: TIKA-1149
>                 URL: https://issues.apache.org/jira/browse/TIKA-1149
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3, 1.4
>            Reporter: Luca Della Toffola
>            Priority: Minor
>              Labels: performance
>         Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch, CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the returned value instead. 
> This can be done safely even under the assumption that the media-registry and the list of component parsers do change while Tika is executing, by invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of CompositeParser.
> The patch checks for the case where the context is empty and invalidates the cache if both media-registry and the list of component parsers change in the corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e., Java class library + Tika app + other apps), the patch reduces the running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order of magnitude are found also for smaller workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)