You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/10/04 12:47:00 UTC
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

    [ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638183#comment-16638183 ] 

Tim Allison edited comment on TIKA-2749 at 10/4/18 12:46 PM:
-------------------------------------------------------------

The two basic options (see our [wiki on OCR and PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]):

1) run OCR on each inline image
2) render the page and then run OCR on that single image

My strawman, heuristic, 100% hackery proposal is this:

0) trigger OCR if fewer than 10 words are extracted from a page
1) if <= 5 inline images, run OCR on each of the inline images (strategy 1)
2) if a page contains > 5 inline images, render the full page and run OCR on that (strategy 2)

[~lfcnassif], I _think_ (0) above derives from one of your recommendations?  Please chime in on this ticket. :D

This issue will take some time.  I don't plan to move out on it any time quickly.


was (Author: tallison@mitre.org):
The two basic options (see our [wiki on OCR and PDFs|https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR]:

1) run OCR on each inline image
2) render the page and then run OCR on that single image

My strawman, heuristic, 100% hackery proposal is this:

0) trigger OCR if fewer than 10 words are extracted from a page
1) if <= 5 inline images, run OCR on each of the inline images (strategy 1)
2) if a page contains > 5 inline images, render the full page and run OCR on that (strategy 2)

[~lfcnassif], I _think_ (0) above derives from one of your recommendations?  Please chime in on this ticket. :D

This issue will take some time.  I don't plan to move out on it any time quickly.

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on inline images within PDFs.  The user has to 1) understand that these are available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid strategy between the 2 options.  Users should still be allowed to configure as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)