You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/03/28 16:16:32 UTC

[jira] [Commented] (PDFBOX-457) PDF to Image doesn't show correctly the document

    [ https://issues.apache.org/jira/browse/PDFBOX-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950868#comment-13950868 ] 

Tilman Hausherr commented on PDFBOX-457:
----------------------------------------

I had another look at the file and tried to view it with several java open source products, and they failed. One closed source product I won't name succeeded, and gsview and pdf.js also succeeded.

I also traced through the filter and noticed that it had a wrong length. The stream is encoded twice, and the ccitt filter comes second, but gets the "/Length". This makes no sense and no other filter does this, so I deleted it in rev XXX for the trunk and rev XXX for the 1.8 branch. (Will commit this later)

However that wasn't the cause of the crash. I then compared the ccitt encoded content with the content of what I get when converting the PDF with gsview to PS and back to PDF and it looked like extra bytes at the beginning. After some trying, I was able to decode the ccitt stream in debugging by positioning past these 6 bytes: " 5d7&ยก" (20 35 64 37 26 A1). That is not a solution. To rule out a java bug in "Inflate", I tested the original ZLIB library and I also get the 6 bytes. I built a TIF from scratch containing the ccitt g4 stream and could not display it with IrfanView (which uses libtiff), but could after removing the 6 bytes. Same for JAI, when I tried to read the TIFF created with the 6 bytes and saving it.

So it is very mysterious. I looked at the source of pdf.js and can't see a retry logic.

https://github.com/mozilla/pdf.js/blob/master/src/core/parser.js
https://github.com/mozilla/pdf.js/blob/master/src/core/stream.js

That is an interesting piece of code, it handles all three CCITT compressions with the same class. The code for GS is also available and similarly complex.

> PDF to Image doesn't show correctly the document
> ------------------------------------------------
>
>                 Key: PDFBOX-457
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-457
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 0.8.0-incubator
>            Reporter: Marcelo Tavares
>            Assignee: Daniel Wilson
>              Labels: CCITTFaxDecode, TIFF, ccitt
>         Attachments: 580505.PR00003.000003.PDF, pdfbox-457-Scan_from_a_Xerox_WorkCentre_Pro.PDF, pdfbox-457-as_fax.pdf, pdfbox-457.PNG, testPDFToImage1.png
>
>
> I tried to convert the following document to image, but I got the attached result. 
> It parsed just the text. I also tried different formats like JPG.  I ran it using the PDFToImage class passing the document path as parameter. 
> I've read that sometimes the document is not created respecting the PDF standard. But, is there a possibility to ignore it?! In fact, it's very important to me, so, could I use PDF Box despite of those "errors"? 
> Thank you
> Marcelo



--
This message was sent by Atlassian JIRA
(v6.2#6252)