You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Arnaud Jeansen (Jira)" <ji...@apache.org> on 2020/02/20 09:59:00 UTC
[jira] [Created] (PDFBOX-4781) PDF files with invalid compressed
streams cannot be rendered
Arnaud Jeansen created PDFBOX-4781:
--------------------------------------
Summary: PDF files with invalid compressed streams cannot be rendered
Key: PDFBOX-4781
URL: https://issues.apache.org/jira/browse/PDFBOX-4781
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.18
Reporter: Arnaud Jeansen
I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
The code is basically as follows
{code:java}
byte[] pdfFile = ...;
float dpi = 72L;
try (PDDocument pdfDocument = PDDocument.load(new ByteArrayInputStream(pdfFile))) {
PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
} catch (IOException e) {
// Error handling
}
{code}
This works fine but for a few PDF files with an invalid compressed stream.
Note: Thes PDF files open fine with a variety of PDF readers and java libraries. Only pdfbox seems to fail on them.
For those files, I get an error log "FlateFilter: stop reading corrupt stream due to a DataFormatException" *and* an `IOException` with stacktrace
{noformat}
Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
at com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
... 68 common frames omitted
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
... 82 common frames omitted
{noformat}
Looking further into `org.apache.pdfbox.filter.FlateFilter` :
* The underlying `DataFormatException` (= broken content that cannot be decompressed when reading the stream) is forwarded up *only* if nothing could be read from this stream
(see FlateFilter#decompress)
* The `DataFormatException` is wrapped unconditionally into an `IOException`.
(see FlateFilter#decode)
As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes things work. I get an error log but a thumbnail is correctly generated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org