You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/24 14:59:00 UTC

[jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request

    [ https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489163#comment-16489163 ] 

ASF GitHub Bot commented on TIKA-2520:
--------------------------------------

mbaechler opened a new pull request #237: TIKA-2520 optimize OptimaizeLangDetector default loadModel()
URL: https://github.com/apache/tika/pull/237
 
 
   	When using tika-server, every single call triggers loadModel and
   	this method is very CPU intensive.
   
   	Optimaize uses immutable objects so we can easily reuse the
   	default model when no configuration is provided, it's a
   	20x improvement for my workload.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy `loadModels` operation for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
> 	LanguageResult language = new OptimaizeLangDetector().loadModels().detect(string);
> 	String detectedLang = language.getLanguage();
> 	LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
> 	return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them in memory. I assume the `LanguageDetector` is not thread safe, so I expect this requires an ExecutorService with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)