You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Markus Mandalka (JIRA)" <ji...@apache.org> on 2019/01/02 13:12:00 UTC

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

    [ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732035#comment-16732035 ] 

Markus Mandalka commented on TIKA-2749:
---------------------------------------

Some ideas/experience/wishes from my side for development of Open Semantic Search i'd like if

 
 * this OCR could be deactivated on document level / by HTTP/REST option (which can be disabled by using /bin/false as definition of the tesseract binary which i am doing now after a tip thanks to Tim Allison)
 * for this case Tika would add a state/bool/info if document is OCRable (or i could infer it from metadata fields - maybe there are such infos even if bin/false used, i had yet no time to look deeper), if there are images which would be OCRd but aren't because i disabled OCR by first point

 

since i plan/wish/implement for Open Semantic ETL for the future to

- first extract / index documents without OCR without to change the global tika config

and would like to be able later

-reextract/index documents with OCR later (which for performance / not to do the full extraction second time for documents where OCR would not make a difference) could be limited/filtered/optimized by such a info from former extraction without ocr to only such documents where there is something for OCR

 

since OCR often needs much processing time for often "only" few additional infos and so could run afterwards for only documents including images while users could find most infos much earlier / work with a relative good index soon and a OCRd/better index later instead of waiting days/weeks on first indexing of large document sets.

 

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on inline images within PDFs.  The user has to 1) understand that these are available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid strategy between the 2 options.  Users should still be allowed to configure as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)