You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/09/04 08:11:51 UTC

[jira] [Commented] (PDFBOX-2313) ExtractImages finds never-rendered images

    [ https://issues.apache.org/jira/browse/PDFBOX-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120996#comment-14120996 ] 

Tilman Hausherr commented on PDFBOX-2313:
-----------------------------------------

To only extract images which are actually drawn to the page at some point we'd need to create something like a PageDrawer that doesn't draw. We would have to parse all streams of an image, including substreams, and remember which images we saved. Nobody has ever complained about getting too many images (including me, I noticed was getting more than I needed years ago already).

It would be easier to just put back the PDImageXObject.clear() method removed in PDFBOX-2310. Or get rid of the image caching, or improve it, e.g. by only caching small images, or only caching the last recently used images.

> ExtractImages finds never-rendered images
> -----------------------------------------
>
>                 Key: PDFBOX-2313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>
> The file from PDFBOX-2101 is still causing unexpectedly high memory use with ExtractImages when compared to PDFToImage. Given that PDFToImage uses the same caching strategy, it's not really a caching issue, though we might still want to think about that.
> The PDF contains 55 images on the first page which are never rendered and ExtractImages runs out of memory trying to extract them all. Given that PDFs often contain junk like this, I suggest that ExtractImages only extract images which are actually drawn to the page at some point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)