You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/03/24 10:22:00 UTC
[jira] [Comment Edited] (PDFBOX-4781) PDF files with invalid compressed streams cannot be rendered

    [ https://issues.apache.org/jira/browse/PDFBOX-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065489#comment-17065489 ] 

Tilman Hausherr edited comment on PDFBOX-4781 at 3/24/20, 10:21 AM:
--------------------------------------------------------------------

Wow, that PDF is really broken, here are the errors from PDF.js:
{noformat}
Warning: Indexing all PDF objects
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 194"
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 72, 195"
Warning: Native JPEG decoding failed -- trying to recover: Error during JPEG image loading
Warning: Unable to decode image: JpegError: JPEG error: SOI not found
...
{noformat}
I suspect that this file was opened with an ordinary editor, modified, and then saved. Object 9 is said to have a length of 37, but has a length of 53.

There are a lot of other errors, e.g. when opening the images. This is a telecom invoice. But PDF.js and Chrome (I didn't try Adobe) do not show the company logo (have you ever seen an invoice without company logo?). The third page is also not shown, although the text indicates there is one.

Changing the Flate filter code so that it returns an empty result would mean that incorrect streams would not be detected by preflight, our PDF/A-1b checker.

I'd prefer that you "hack" PDFBox on your own for that application (thumbnails), or refuse to create thumbnails for a broken PDF, i.e. create an "X" instead, maybe with a text "thumbnail could not be created, the PDF may be corrupt or incomplete". The hack would be OK as long as you don't use the jar for anything else.


was (Author: tilman):
Wow, that PDF is really broken, here are the errors from PDF.js:
{noformat}
Warning: Indexing all PDF objects
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 120, 194"
Warning: Invalid stream: "FormatError: Bad FCHECK in flate stream: 72, 195"
Warning: Native JPEG decoding failed -- trying to recover: Error during JPEG image loading
Warning: Unable to decode image: JpegError: JPEG error: SOI not found
...
{noformat}
I suspect that this file was opened with an ordinary editor, modified, and then saved. Object 9 is said to have a length of 37, but has a length of 53.

There are a lot of other errors, e.g. when opening the images. This is a telecom invoice. But PDF.js and Chrome (I didn't try Adobe) do not show the company logo (have you ever seen an invoice without company logo?). The third page is also not shown, although the text indicates there is one.

Changing the Flate filter code so that it returns an empty result would mean that incorrect streams would not be detected by preflight, our PDF/A-1b checker.

I'd prefer that you "hack" PDFBox on your own for that application (thumbnails), or refuse to create thumbnails for a broken PDF, i.e. create an "X" instead, maybe with a text "thumbnail could not be created, the PDF may be corrup ot incomplete". The hack would be OK as long as you don't use the jar for anything else.

> PDF files with invalid compressed streams cannot be rendered
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4781
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4781
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.18
>            Reporter: Arnaud Jeansen
>            Priority: Major
>
> I am using pdfbox 2.0.18 to generate thumbnails of PDF files.
> The code is basically as follows
> {code:java}
>     byte[] pdfFile = ...;
>     float dpi = 72L;
>     try (PDDocument pdfDocument = PDDocument.load(new ByteArrayInputStream(pdfFile))) {
>       PDFRenderer pdfRenderer = new PDFRenderer(pdfDocument);
>       return pdfRenderer.renderImageWithDPI(0, dpi, ImageType.RGB);
>     } catch (IOException e) {
>       // Error handling
>     }
> {code}
> This works fine but for a few PDF files with an invalid compressed stream.
> Note: These PDF files open fine with a variety of PDF readers and java libraries. Only pdfbox seems to fail on them.
> For those files, I get an error log "FlateFilter: stop reading corrupt stream due to a DataFormatException" *and* an `IOException` with stacktrace
> {noformat}
> Caused by: java.io.IOException: java.util.zip.DataFormatException: invalid distance too far back
> 	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:58)
> 	at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
> 	at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:84)
> 	at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
> 	at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
> 	at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:170)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:92)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:499)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
> 	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:269)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
> 	at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229)
> 	at com.foocompany.service.PdfImageService.convertFromPdfBinaryToJpegBinary(PdfImageService.java:167)
> 	... 68 common frames omitted
> Caused by: java.util.zip.DataFormatException: invalid distance too far back
> 	at java.util.zip.Inflater.inflateBytes(Native Method)
> 	at java.util.zip.Inflater.inflate(Inflater.java:259)
> 	at java.util.zip.Inflater.inflate(Inflater.java:280)
> 	at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:83)
> 	at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
> 	... 82 common frames omitted
> {noformat}
> Looking further into `org.apache.pdfbox.filter.FlateFilter` :
> * The underlying `DataFormatException` (= broken content that cannot be decompressed when reading the stream) is forwarded up *only* if nothing could be read from this stream
> (see FlateFilter#decompress)
> * The `DataFormatException` is wrapped unconditionally into an `IOException`.
> (see FlateFilter#decode)
> As a hack, swallowing `DataFormatException` in `FlateFilter#decode` makes things work. I get an error log but a thumbnail is correctly generated.
> I am not sure how to proceed from here. I am willing to write a patch but I am not sure how to address this issue.
> I can also provide a PDF file that exhibits the problem privately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org