You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2014/07/23 18:26:27 UTC

RE: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Andreas and Tilman,

  Thank you very much for fixing this so quickly.  I'm finally getting around to figuring out if we should change anything in the Tika code based on your fixes.  If I follow the example of the latest ExtractImages for the 1.8x branch, I think I see that we should add:

1) resources.clear() at the end of processResources()
2) image.clear() after image.write2File()

Is there anything else that our client code should do to decrease the memory footprint during extraction of images?  Thank you, again!

     Best,

              Tim

________________________________________
From: Andreas Lehmkühler (JIRA) [jira@apache.org]
Sent: Sunday, June 15, 2014 7:36 AM
To: dev@pdfbox.apache.org
Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

     [ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-2101.
----------------------------------------

    Resolution: Fixed

> Surprising memory consumption when extracting images
> ----------------------------------------------------
>
>                 Key: PDFBOX-2101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 1.8.5
>         Environment: Windows 7
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 1.8.6, 2.0.0
>
>         Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip
>
>
> ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.
> On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g available, ExtractImages will work.
> With some experimentation, the triggers seem to be JPEG images that have masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
> Commandlines:
> 1.8.5:
> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
> 2.0_SNAPSHOT:
> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
> Results:
> 1.8.5: 906 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
> 514)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
> ixelMap.java:217)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
> eam(PDPixelMap.java:363)
>         at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
> PDXObjectImage.java:254)
>         at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
> 02)
>         at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>         at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
> {noformat}
> 2.0_SNAPSHOT: 428 files before OOM
> {noformat}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
> va:93)
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>         at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>         at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>         at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
> SampledImageReader.java:171)
>         at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
> ge(SampledImageReader.java:154)
>         at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
> ageXObject.java:171)
>         at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
> 31)
>         at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
> java:206)
>         at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
> a:164)
>         at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

RE: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Got it.  Will do.  Thank you.

________________________________________
From: Tilman Hausherr [THausherr@t-online.de]
Sent: Wednesday, July 23, 2014 1:28 PM
To: dev@pdfbox.apache.org
Subject: Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Hi Tim,
if you're working with pages (PDPage), you can also call .clear() after
you're done.
Tilman

Am 23.07.2014 18:26, schrieb Allison, Timothy B.:
> Andreas and Tilman,
>
>    Thank you very much for fixing this so quickly.  I'm finally getting around to figuring out if we should change anything in the Tika code based on your fixes.  If I follow the example of the latest ExtractImages for the 1.8x branch, I think I see that we should add:
>
> 1) resources.clear() at the end of processResources()
> 2) image.clear() after image.write2File()
>
> Is there anything else that our client code should do to decrease the memory footprint during extraction of images?  Thank you, again!
>
>       Best,
>
>                Tim
>
> ________________________________________
> From: Andreas Lehmkühler (JIRA) [jira@apache.org]
> Sent: Sunday, June 15, 2014 7:36 AM
> To: dev@pdfbox.apache.org
> Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images
>
>       [ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Andreas Lehmkühler resolved PDFBOX-2101.
> ----------------------------------------
>
>      Resolution: Fixed
>
>> Surprising memory consumption when extracting images
>> ----------------------------------------------------
>>
>>                  Key: PDFBOX-2101
>>                  URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>>              Project: PDFBox
>>           Issue Type: Bug
>>           Components: Utilities
>>     Affects Versions: 1.8.5
>>          Environment: Windows 7
>> java version "1.7.0_55"
>> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>>             Reporter: Tim Allison
>>             Assignee: Andreas Lehmkühler
>>             Priority: Minor
>>              Fix For: 1.8.6, 2.0.0
>>
>>          Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip
>>
>>
>> ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.
>> On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g available, ExtractImages will work.
>> With some experimentation, the triggers seem to be JPEG images that have masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
>> Commandlines:
>> 1.8.5:
>> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
>> 2.0_SNAPSHOT:
>> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
>> Results:
>> 1.8.5: 906 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
>> 514)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
>> ixelMap.java:217)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
>> eam(PDPixelMap.java:363)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
>> PDXObjectImage.java:254)
>>          at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
>> 02)
>>          at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>>          at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
>> {noformat}
>> 2.0_SNAPSHOT: 428 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>>          at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>>          at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
>> SampledImageReader.java:171)
>>          at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
>> ge(SampledImageReader.java:154)
>>          at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
>> ageXObject.java:171)
>>          at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
>> 31)
>>          at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
>> java:206)
>>          at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
>> a:164)
>>          at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
>> {noformat}
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)

Re: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi Tim,
if you're working with pages (PDPage), you can also call .clear() after 
you're done.
Tilman

Am 23.07.2014 18:26, schrieb Allison, Timothy B.:
> Andreas and Tilman,
>
>    Thank you very much for fixing this so quickly.  I'm finally getting around to figuring out if we should change anything in the Tika code based on your fixes.  If I follow the example of the latest ExtractImages for the 1.8x branch, I think I see that we should add:
>
> 1) resources.clear() at the end of processResources()
> 2) image.clear() after image.write2File()
>
> Is there anything else that our client code should do to decrease the memory footprint during extraction of images?  Thank you, again!
>
>       Best,
>
>                Tim
>
> ________________________________________
> From: Andreas Lehmkühler (JIRA) [jira@apache.org]
> Sent: Sunday, June 15, 2014 7:36 AM
> To: dev@pdfbox.apache.org
> Subject: [jira] [Resolved] (PDFBOX-2101) Surprising memory consumption when extracting images
>
>       [ https://issues.apache.org/jira/browse/PDFBOX-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Andreas Lehmkühler resolved PDFBOX-2101.
> ----------------------------------------
>
>      Resolution: Fixed
>
>> Surprising memory consumption when extracting images
>> ----------------------------------------------------
>>
>>                  Key: PDFBOX-2101
>>                  URL: https://issues.apache.org/jira/browse/PDFBOX-2101
>>              Project: PDFBox
>>           Issue Type: Bug
>>           Components: Utilities
>>     Affects Versions: 1.8.5
>>          Environment: Windows 7
>> java version "1.7.0_55"
>> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>>             Reporter: Tim Allison
>>             Assignee: Andreas Lehmkühler
>>             Priority: Minor
>>              Fix For: 1.8.6, 2.0.0
>>
>>          Attachments: 239665.pdf, PDFBOX-2101-298-good.jpg, PDFBOX-2101-714-poor.jpg, java.hprof.zip
>>
>>
>> ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.
>> On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g.  If there is no Xmx and there is > 2.5g available, ExtractImages will work.
>> With some experimentation, the triggers seem to be JPEG images that have masks.  I'm not sure, though, whether the issue is with PDFBox or Java.
>> Commandlines:
>> 1.8.5:
>> java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
>> 2.0_SNAPSHOT:
>> java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
>> Results:
>> 1.8.5: 906 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
>> 514)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
>> ixelMap.java:217)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
>> eam(PDPixelMap.java:363)
>>          at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
>> PDXObjectImage.java:254)
>>          at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
>> 02)
>>          at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
>>          at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
>> {noformat}
>> 2.0_SNAPSHOT: 428 files before OOM
>> {noformat}
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>          at java.util.Arrays.copyOf(Arrays.java:2271)
>>          at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>          at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
>> va:93)
>>          at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>          at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
>>          at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
>>          at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
>> SampledImageReader.java:171)
>>          at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
>> ge(SampledImageReader.java:154)
>>          at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
>> ageXObject.java:171)
>>          at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
>> 31)
>>          at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
>> java:206)
>>          at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
>> a:164)
>>          at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
>> {noformat}
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)