You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Sascha Szott (JIRA)" <ji...@apache.org> on 2009/08/11 01:28:14 UTC

[jira] Created: (TIKA-267) encrypted files aren't handled properly

encrypted files aren't handled properly
---------------------------------------

                 Key: TIKA-267
                 URL: https://issues.apache.org/jira/browse/TIKA-267
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
         Environment: Ubuntu Linux 8.10, JRE 1.5
            Reporter: Sascha Szott
            Priority: Critical


While I was working on extracting full texts out of a bunch of pdf documents, I realized an odd behaviour of Tika when processing encrypted documents (those documents that restrict the execution of specific actions, e.g. editing or printing). To extract content from an encrypted pdf document you do not have to decrypt the document in every case. For instance, when creating an (encrypted) pdf document the author can decide to allow content extraction without the need of providing a password. Unfortunately, Tika's pdf parser isn't aware of this at the moment. Therefore, I suggest a minor change inside the parse method in class org.apache.tika.parser.pdf.PDFParser by introducing an additional check ("is copying allowed") before trying to decrypt the document.

To be more precise, I'll provide a code snippet:

public void parse(...) throws ... {
  PDDocument pdfDocument = PDDocument.load(stream);
  try {
    //decrypt document only if copying is not allowed
    if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
      if (pdfDocument.isEncrypted()) {
        try {
          pdfDocument.decrypt("");
        } catch (Exception e) {
          // Ignore
        }
      }
    }
    ...

Another solution to this problem would be to eliminate the "isEncrypted" check since PDFBox seems to handle the extraction of content out of encrypted documents correctly (and throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-267) encrypted pdf files aren't handled properly

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-267.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Good point, thanks! Fixed as suggested in revision 806888.

> encrypted pdf files aren't handled properly
> -------------------------------------------
>
>                 Key: TIKA-267
>                 URL: https://issues.apache.org/jira/browse/TIKA-267
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux 8.10, JRE 1.5
>            Reporter: Sascha Szott
>            Assignee: Jukka Zitting
>            Priority: Critical
>             Fix For: 0.5
>
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents, I realized an odd behaviour of Tika when processing encrypted documents (those documents that restrict the execution of specific actions, e.g. editing or printing). To extract content from an encrypted pdf document you do not have to decrypt the document in every case. For instance, when creating an (encrypted) pdf document the author can decide to allow content extraction without the need of providing a password. Unfortunately, Tika's pdf parser isn't aware of this at the moment. Therefore, I suggest a minor change inside the parse method in class org.apache.tika.parser.pdf.PDFParser by introducing an additional check ("is copying allowed") before trying to decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
>   PDDocument pdfDocument = PDDocument.load(stream);
>   try {
>     //decrypt document only if copying is not allowed
>     if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
>       if (pdfDocument.isEncrypted()) {
>         try {
>           pdfDocument.decrypt("");
>         } catch (Exception e) {
>           // Ignore
>         }
>       }
>     }
>     ...
> Another solution to this problem would be to eliminate the "isEncrypted" check since PDFBox seems to handle the extraction of content out of encrypted documents correctly (and throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-267) encrypted pdf files aren't handled properly

Posted by "Sascha Szott (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sascha Szott updated TIKA-267:
------------------------------

    Summary: encrypted pdf files aren't handled properly  (was: encrypted files aren't handled properly)

> encrypted pdf files aren't handled properly
> -------------------------------------------
>
>                 Key: TIKA-267
>                 URL: https://issues.apache.org/jira/browse/TIKA-267
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux 8.10, JRE 1.5
>            Reporter: Sascha Szott
>            Priority: Critical
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents, I realized an odd behaviour of Tika when processing encrypted documents (those documents that restrict the execution of specific actions, e.g. editing or printing). To extract content from an encrypted pdf document you do not have to decrypt the document in every case. For instance, when creating an (encrypted) pdf document the author can decide to allow content extraction without the need of providing a password. Unfortunately, Tika's pdf parser isn't aware of this at the moment. Therefore, I suggest a minor change inside the parse method in class org.apache.tika.parser.pdf.PDFParser by introducing an additional check ("is copying allowed") before trying to decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
>   PDDocument pdfDocument = PDDocument.load(stream);
>   try {
>     //decrypt document only if copying is not allowed
>     if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
>       if (pdfDocument.isEncrypted()) {
>         try {
>           pdfDocument.decrypt("");
>         } catch (Exception e) {
>           // Ignore
>         }
>       }
>     }
>     ...
> Another solution to this problem would be to eliminate the "isEncrypted" check since PDFBox seems to handle the extraction of content out of encrypted documents correctly (and throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.