You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/20 18:24:00 UTC

[jira] [Reopened] (TIKA-3270) Render non-text in PDFs for OCR

     [ https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison reopened TIKA-3270:
-------------------------------

Have to rework the logic a bit.  The rendering strategy default is "render with no text and then run OCR" as the default.  However, we should make the default a bit smarter...an AUTO rendering mode.

If you're in AUTO (OCR mode)  and OCR is triggered because of missing unicode code points, then you'd want to run OCR on everything. If it is triggered because of too few characters, then you'd still want to run OCR on everything.

If you're in OCR_ONLY mode, you'd want to run OCR on everything (or maybe _only_ the text?)

If you're in TEXT_AND_OCR mode, you'd want OCR on the not-text bits.



> Render non-text in PDFs for OCR
> -------------------------------
>
>                 Key: TIKA-3270
>                 URL: https://issues.apache.org/jira/browse/TIKA-3270
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of the contents of the page, including text that may be available via regular extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be duplicate text -- text as stored in PDFs and the text generated via OCR.  In the xhtml output, we do mark a separate "div" for OCR so that users can distinguish, but still, it might be useful not to have to run OCR on text that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a technical/implementation recommendation by [~tilman] to subclass PDFRenderer and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)