You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/25 15:14:00 UTC

[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

    [ https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512428#comment-17512428 ] 

Tim Allison commented on TIKA-3571:
-----------------------------------

I'm starting to hack on this now.  Given that images can be fairly large and given that there can be many pages/slides per document, it feels like we need to have the renderer write to local files. We'll need to do this anyways for ocr (at least with tesseract), etc., and wrappers around commandline renderers (e.g mutool) will naturally write to files.

> Add an interface for rendering engines
> --------------------------------------
>
>                 Key: TIKA-3571
>                 URL: https://issues.apache.org/jira/browse/TIKA-3571
>             Project: Tika
>          Issue Type: Wish
>            Reporter: Tim Allison
>            Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and certainly it might be useful to have alternatives for rendering files (e.g. this [Alfresco study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]), including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)