You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2009/02/06 14:13:59 UTC
[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on
encrypted PDF with empty password
[ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671117#action_12671117 ]
Andrzej Bialecki commented on NUTCH-643:
-----------------------------------------
Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. During tests on documents containing images I discovered that it's necessary to add JAI libraries too - this unfortunately increased the size of the plugin.
> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
> Key: NUTCH-643
> URL: https://issues.apache.org/jira/browse/NUTCH-643
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.0.0
> Environment: This problem affects the current trunk too.
> Reporter: Guillaume Smet
> Assignee: Andrzej Bialecki
> Fix For: 1.0.0
>
> Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
> if (pdf.isEncrypted()) {
> DocumentEncryption decryptor = new DocumentEncryption(pdf);
> //Just try using the default password and move on
> decryptor.decryptDocument("");
> }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> if (pdf.isEncrypted()) {
> // Just try using the default password and move on
> pdf.openProtection(new StandardDecryptionMaterial(""));
> }
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> --
> Guillaume
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.