You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luís Filipe Nassif (Jira)" <ji...@apache.org> on 2021/02/14 14:42:00 UTC

[jira] [Commented] (TIKA-3300) Figure out if we can improve tesseract parallelization

    [ https://issues.apache.org/jira/browse/TIKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284419#comment-17284419 ] 

Luís Filipe Nassif commented on TIKA-3300:
------------------------------------------

I also set OMP_THREAD_LIMIT = 1 because my app is already multithreaded (ocr many files simultaneously). That gave me about 2x-2.5x overall speed up. But if the client app is monothreaded, I would use the default value, so tesseract will use multiple threads to OCR each submitted file. Maybe tika-server and tika-app should set this?

> Figure out if we can improve tesseract parallelization 
> -------------------------------------------------------
>
>                 Key: TIKA-3300
>                 URL: https://issues.apache.org/jira/browse/TIKA-3300
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> https://github.com/tesseract-ocr/tesseract/issues/2609
> https://twitter.com/jbaiter_/status/1360266497864704008?s=20
> Not sure if this affects us? h/t [~jbaiter]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)