You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Josh Nankin (Created) (JIRA)" <ji...@apache.org> on 2012/02/15 22:36:59 UTC

[jira] [Created] (PDFBOX-1228) PDocument corrupts file

PDocument corrupts file
-----------------------

                 Key: PDFBOX-1228
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
             Project: PDFBox
          Issue Type: Bug
          Components: PDModel
    Affects Versions: 1.6.0, 1.7.0
         Environment: Ubuntu 10.04 amd64
            Reporter: Josh Nankin
             Fix For: 1.7.0, 1.6.0
         Attachments: in.pdf

I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.

Here's the code:

        PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
        doc.save("/home/jnankin/Desktop/out.pdf");
        doc.close();

Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1228) PDocument corrupts file

Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219490#comment-13219490 ] 

Andreas Lehmkühler commented on PDFBOX-1228:
--------------------------------------------

@Josh

1. Encrypted pdfs don't have to be password protected .

2. To compare PDFBox and ghostscript is like comparing apples and oranges. gs is a (enduser) tool and PDFBox is a library to be used within applications. So, you don't have to know the pdf-spec inside  out to use PDFBox but some of magic has to be implemented by yourself. Sorry, but a simple load + save isn't enough. Have a look at [1] to see how to decrypt an already loaded pdf using just a couple of lines of code. 

[1] http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java

                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Closed] (PDFBOX-1228) PDocument corrupts file

Posted by "Andreas Lehmkühler (Closed JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-1228.
--------------------------------------

    Resolution: Not A Problem
      Assignee: Andreas Lehmkühler

As George and others on the mailing list already mentioned, the pdf is encrypted and the content gets lost if one didn't decrpyt it first.

This is neither a problem nor a missing feature. If you want to access such a file you have to decrypt it. If you don't want to do anything just leave it alone.

Nevertheless thanks for the comments
                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1228) PDocument corrupts file

Posted by "Josh Nankin (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Nankin updated PDFBOX-1228:
--------------------------------

    Attachment: in.pdf
    
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>             Fix For: 1.6.0, 1.7.0
>
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1228) PDocument corrupts file

Posted by "Josh Nankin (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Nankin updated PDFBOX-1228:
--------------------------------

    Priority: Critical  (was: Major)
    
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Priority: Critical
>             Fix For: 1.6.0, 1.7.0
>
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1228) PDocument corrupts file

Posted by "Andreas Lehmkühler (Updated JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-1228:
---------------------------------------

    Fix Version/s:     (was: 1.7.0)
                       (was: 1.6.0)
    
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1228) PDocument corrupts file

Posted by "Josh Nankin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219384#comment-13219384 ] 

Josh Nankin commented on PDFBOX-1228:
-------------------------------------

Are we talking about the same document?  I can open this document without entering any password and can view everything normally.  Are you referring to another type of encryption?  Additionally, if I run the file through ghostscript first, save it to another location, and then load it into a PDDocument, this problem goes away.

I'm not a PDF expert by any means, but if ghostscript can manipulate this file without corrupting it, I would expect PDFBox to have the same behavior without the end user (developer) having to worry about some sort of non-visible encryption.  Someone who knows nothing about PDFs should be able to use this library right out of the box (no pun intended :))

 
                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1228) PDocument corrupts file

Posted by "Thomas Chojecki (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219921#comment-13219921 ] 

Thomas Chojecki commented on PDFBOX-1228:
-----------------------------------------

I never test it, but what happen if you just add the encryption password as george posted. As <owner password> just use an empty string, so that it would look like this new StandardDecryptionMaterial("").

If this will help, you can try to ask the pddocument if it is encrypted and try to open it with the empty string. if this fail, throw an exception and inform the user the document is protected and could not be decrypted.

On the other hand, the pdfbox shouldn't destroy documents if the user use the save method on encrypted documents without decrypting it.I would prefere to open the issue and do some check while saving a document. better throw an exception and inform the user that something is going wrong while saving so that he has a chance to react (eg. in unit tests).

                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1228) PDocument corrupts file

Posted by "George Kalpakas (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218964#comment-13218964 ] 

George Kalpakas commented on PDFBOX-1228:
-----------------------------------------

Hi, 

I don't know if this helps at all, but I noticed that the described problem is caused by the document's being encrypted with an owner password. If (somehow) one was able to remove the protection, then the saved document would be an exact copy of the original. 

So, I guess it is good as it is, so that <PDDocument.save(...)> won't decrypt a password protected document. (One could have invoked something like <doc.openProtection(new StandardDecryptionMaterial("<owner_pass>"))> before saving the document in order to decrypt it). 
On the other hand, I think it would be "nice", if the user would be somehow informed (by means of an Exception ??), that he is trying to save an encrypted document, without decrypting it first (unless there are any cases where it is desirable for the user to do so - nothing crosses my mind, but being new to all this PDF stuff, that doesn't tell much). 

Well, that was it. 
I hope my remarks will be of some use to someone more knowledgeable !

GK
                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Priority: Critical
>             Fix For: 1.6.0, 1.7.0
>
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1228) PDocument corrupts file

Posted by "Josh Nankin (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219386#comment-13219386 ] 

Josh Nankin commented on PDFBOX-1228:
-------------------------------------

Here is some code that takes the document attached to this ticket and resizes it without fail (or decryption) with ghostscript:

        long start = System.currentTimeMillis();
        Runtime run = Runtime.getRuntime();
        Process pr = run.exec("gs -dQUIET -dNOPAUSE -dBATCH -sPAPERSIZE=letter -sDEVICE=pdfwrite -sOutputFile=" + resizedFilename + " -dPDFFitPage " + inputFilename);
        pr.waitFor();
        log.info("Resize complete.  Took " + (System.currentTimeMillis() - start));
                
> PDocument corrupts file
> -----------------------
>
>                 Key: PDFBOX-1228
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1228
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: Ubuntu 10.04 amd64
>            Reporter: Josh Nankin
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>         Attachments: in.pdf
>
>
> I have a file (attached) that when loaded with PDocument.load and then saved to another location simply saves as a blank PDF.  The number of pages is correct, but when opened in Acrobat, all the page names are corrupted and the pages are blank.
> Here's the code:
>         PDDocument doc = PDDocument.load("/home/jnankin/Desktop/in.pdf");
>         doc.save("/home/jnankin/Desktop/out.pdf");
>         doc.close();
> Please advise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira