You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/06/05 19:14:02 UTC
[jira] [Comment Edited] (PDFBOX-2101) Surprising memory consumption when extracting images

    [ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018977#comment-14018977 ] 

Andreas Lehmkühler edited comment on PDFBOX-2101 at 6/5/14 5:12 PM:
--------------------------------------------------------------------

All resources of a page are now automatically cleared after the conversion to an image per default. The user may decide to disable that  feature when creating the PDFRenderer. I've added those changes in revision http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any PDFRenderer so that I've added a clear method to the PDPage class only in revsion http://svn.apache.org/r1600701

[~tilman] IMHO it's more convenient to clear the resources automatically as the user may not be aware about those cached values.


was (Author: lehmi):
All resources of a page are now automatically cleared after the conversion to an image per default. The user may decide to disable that  feature when creating the PDFRenderer. I've added those changes in revision http://svn.apache.org/r1600699 to the trunk. In the 1.8 branch is any PDFRenderer so that I've added a clear method to the PDPage class only in revsion http://svn.apache.org/r1600701

> Surprising memory consumption when extracting images
> ----------------------------------------------------
>
>                 Key: PDFBOX-2101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.5
>         Environment: Windows 7
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>         Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip
>
>
> ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.  
> On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g available, ExtractImages will work.
> With some experimentation, the triggers seem to be JPEG images that have masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
> Commandlines:
> 1.8.5:
> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
> 2.0_SNAPSHOT:
> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
> Results:
> 1.8.5: 906 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
> 514)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
> ixelMap.java:217)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
> eam(PDPixelMap.java:363)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
> PDXObjectImage.java:254)
>         at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
> 02)
>         at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
> {noformat}
> 2.0_SNAPSHOT: 428 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>         at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
> SampledImageReader.java:171)
>         at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
> ge(SampledImageReader.java:154)
>         at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
> ageXObject.java:171)
>         at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
> 31)
>         at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
> java:206)
>         at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
> a:164)
>         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)