You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/24 16:49:00 UTC

[jira] [Commented] (TIKA-3416) Extract logical images from PDFs

    [ https://issues.apache.org/jira/browse/TIKA-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350521#comment-17350521 ] 

Tim Allison commented on TIKA-3416:
-----------------------------------

This has some overlaps with TIKA-3348 but is distinct.

> Extract logical images from PDFs
> --------------------------------
>
>                 Key: TIKA-3416
>                 URL: https://issues.apache.org/jira/browse/TIKA-3416
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> PDFs, bless their hearts, can store a logical image as hundreds or thousands of subimages that when rendered, look like one image.  
> We currently have the option to let the user render the page and run OCR on that rendered image, or the user can extract inline images.  There has to be a happier medium, and the user should get back the rendering in, e.g., the /unpack endpoint (see TIKA-3348).
> It would be handy for some use cases to do the geometry to find bounding boxes for image components and then render those bounding boxes so that a human gets a "logical image" <hand_waving>most of the time</hand_waving>.
> There would have to be some heuristics for when to give up and just render the whole page, but I think we could do something that performed well enough.  More importantly, I'm sure this is a solved problem...any recs for efficient algorithms for this?
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)