You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/20 17:52:00 UTC

[jira] [Resolved] (TIKA-3270) Render non-text in PDFs for OCR

     [ https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-3270.
-------------------------------
    Fix Version/s: 2.0.0
         Assignee: Tim Allison
       Resolution: Fixed

This is now in 2.0.0.  I realize that we might want to add an render text only strategy for cases where there's electronic text but the unicode mappings are broken... This may need further tweaks, but the rendering without text was easy.  Thank you [~tilman] and [~lfcnassif]!

> Render non-text in PDFs for OCR
> -------------------------------
>
>                 Key: TIKA-3270
>                 URL: https://issues.apache.org/jira/browse/TIKA-3270
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: test-no-text.png, test.png, tiger-no-text.png, tiger.pdf
>
>
> When we render a PDF page for OCR, we are relying on PDFBox to render all of the contents of the page, including text that may be available via regular extraction methods.
> The result of this is that if a user selects ocr_and_text, there can be duplicate text -- text as stored in PDFs and the text generated via OCR.  In the xhtml output, we do mark a separate "div" for OCR so that users can distinguish, but still, it might be useful not to have to run OCR on text that was reliably extracted.
> One solution to this was proposed by [~lfcnassif] on TIKA-3258, with a technical/implementation recommendation by [~tilman] to subclass PDFRenderer and PageDrawer to render only the image components of a page.
> This would be a new, non-breaking feature.  This is not a blocker on 2.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)