You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/24 16:46:00 UTC
[jira] [Created] (TIKA-3416) Extract logical images from PDFs
Tim Allison created TIKA-3416:
---------------------------------
Summary: Extract logical images from PDFs
Key: TIKA-3416
URL: https://issues.apache.org/jira/browse/TIKA-3416
Project: Tika
Issue Type: New Feature
Reporter: Tim Allison
PDFs, bless their hearts, can store a logical image as hundreds or thousands of subimages that when rendered, look like one image.
It would be handy for some use cases to do the geometry to find bounding boxes for image components and then render those bounding boxes so that a human gets a "logical image" <hand_waving>most of the time</hand_waving>.
There would have to be some heuristics for when to give up and just render the whole page, but I think we could do something that performed well enough. More importantly, I'm sure this is a solved problem...any recs for efficient algorithms for this?
What do you think?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)