You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2013/07/22 14:58:48 UTC

[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

    [ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715180#comment-13715180 ] 

Jukka Zitting commented on TIKA-1149:
-------------------------------------

Note that for example {{DefaultParser.getParsers(ParseContext)}} can return a different set of parsers on each invocation, thanks to the dynamic service lookup mechanism in {{ServiceLoader}}. Thus caching the return value can lead to incorrect behavior.

An alternative optimization would be to refactor the {{CompositeParser.getParser(Metadata, ParseContext)}} method so that it doesn't need to always instantiate the full type->parser map. Instead it could for example restrict the search to only the specified type and its supertypes.
                
> 12% performance improvement by caching in CompositeParser
> ---------------------------------------------------------
>
>                 Key: TIKA-1149
>                 URL: https://issues.apache.org/jira/browse/TIKA-1149
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3, 1.4
>            Reporter: Luca Della Toffola
>            Priority: Minor
>              Labels: performance
>         Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the returned value instead. 
> This can be done safely even under the assumption that the media-registry and the list of component parsers do change while Tika is executing, by invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of CompositeParser.
> The patch checks for the case where the context is empty and invalidates the cache if both media-registry and the list of component parsers change in the corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e., Java class library + Tika app + other apps), the patch reduces the running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira