You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/05/28 18:08:03 UTC
[jira] [Created] (PDFBOX-2101) Surprising memory consumption when
extracting images
Tim Allison created PDFBOX-2101:
-----------------------------------
Summary: Surprising memory consumption when extracting images
Key: PDFBOX-2101
URL: https://issues.apache.org/jira/browse/PDFBOX-2101
Project: PDFBox
Issue Type: Bug
Components: Utilities
Affects Versions: 1.8.5
Environment: Windows 7
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Reporter: Tim Allison
Priority: Minor
ExtractImages seems to fail to release memory resources on some files in both PDFBox 1.8.5 and trunk.
On this file 4MB file [http://digitalcorpora.org/corp/nps/files/govdocs1/239/239665.pdf], if extracting every image on every page (and there are many, many duplicate images), there is an OOM with -Xmx1g. If there is no Xmx and there is > 2.5g available, ExtractImages will work.
With some experimentation, the triggers seem to be JPEG images that have masks. I'm not sure, though, whether the issue is with PDFBox or Java.
Commandlines:
1.8.5:
java -Xmx1g -cp pdfbox-app-1.8.5.jar org.apache.pdfbox.ExtractImages 239665.pdf
2.0_SNAPSHOT:
java -Xmx1g -cp pdfbox-app-2.0.0-SNAPSHOT.jar org.apache.pdfbox.tools.ExtractImages -addkey 239665.pdf
Results:
1.8.5: 906 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.pdfbox.pdmodel.common.PDStream.getByteArray(PDStream.java:
514)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDP
ixelMap.java:217)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStr
eam(PDPixelMap.java:363)
at org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage.write2file(
PDXObjectImage.java:254)
at org.apache.pdfbox.ExtractImages.processResources(ExtractImages.java:2
02)
at org.apache.pdfbox.ExtractImages.extractImages(ExtractImages.java:160)
at org.apache.pdfbox.ExtractImages.main(ExtractImages.java:65)
{noformat}
2.0_SNAPSHOT: 428 files before OOM
{noformat}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.ja
va:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.pdfbox.io.IOUtils.copy(IOUtils.java:70)
at org.apache.pdfbox.io.IOUtils.toByteArray(IOUtils.java:52)
at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(
SampledImageReader.java:171)
at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBIma
ge(SampledImageReader.java:154)
at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDIm
ageXObject.java:171)
at org.apache.pdfbox.tools.ExtractImages.write2file(ExtractImages.java:2
31)
at org.apache.pdfbox.tools.ExtractImages.processResources(ExtractImages.
java:206)
at org.apache.pdfbox.tools.ExtractImages.extractImages(ExtractImages.jav
a:164)
at org.apache.pdfbox.tools.ExtractImages.main(ExtractImages.java:69)
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)