You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Guillaume Smet (JIRA)" <ji...@apache.org> on 2008/08/08 15:49:44 UTC

[jira] Created: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

ClassCastException in PdfParser on encrypted PDF with empty password
--------------------------------------------------------------------

                 Key: NUTCH-643
                 URL: https://issues.apache.org/jira/browse/NUTCH-643
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: This problem affects the current trunk too.
            Reporter: Guillaume Smet


Hi,

If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.

This behaviour is implemented with the following code:
      if (pdf.isEncrypted()) {
        DocumentEncryption decryptor = new DocumentEncryption(pdf);
        //Just try using the default password and move on
        decryptor.decryptDocument("");
      }
It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:

2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption

Using the new security API, we don't have any error parsing this document and we can get its content:
			if (pdf.isEncrypted()) {
				// Just try using the default password and move on
				pdf.openProtection(new StandardDecryptionMaterial(""));
			}

I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.

Regards,

-- 
Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Guillaume Smet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668010#action_12668010 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

Hi Doğacan,

The problem isn't the license of PDFBox which is already included in Nutch. It's more than PDFBox is on its way to become an Apache project (it's in the incubator - see http://incubator.apache.org/pdfbox/) and it seems that you can't include a library which is in the incubator.

So you can either wait for PDFBox to be a real Apache project or build a development version of the latest PDFBox tree which is on sourceforge.net, which is what I did (the problem is fixed in the sf.net tree) but you then have a development version in the Nutch tree and not a stable release: I'm not sure it's acceptable.

It's more a problem of release policy and release rules than a technical or license problem.

-- 
Guillaume

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-643.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Andrzej Bialecki 

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668021#action_12668021 ] 

Doğacan Güney commented on NUTCH-643:
-------------------------------------

Right, we should update tika to 0.2 (post-incubation) too before releasing 1.0 :) I actually would do that a while back, but then I know nothing about tika, so worried about breaking stuff.


> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668001#action_12668001 ] 

Doğacan Güney commented on NUTCH-643:
-------------------------------------

So... Can we commit this patch and pdfbox? It seems pdfbox is released under a BSD license. Is it compatible with ASF license?

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Guillaume Smet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620955#action_12620955 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

In fact, the problem is more complex than an API problem and is solved in current PDFBox trunk (from Apache incubator). I used the revision 683874 .

I made the following changes:
- upgrade from FontBox-0.1-dev to FontBox-0.2-dev (shipped in PDFBox lib/ directory)
- upgrade from PDFBox-0.7.3 to PDFBox-0.7.4-dev (rev: 683874)
- copy bcprov-jdk14-132.jar, bcmail-jdk14-132.jar and their licence to parse-pdf lib/ directory: the license seems to be compatible with Apache license (I took the jars from PDFBox trunk)
- fix the deprecation issues in PdfParser

I had a lot of errors indexing a bunch of PDF files from several websites. After this upgrade, it's far far better: I don't have any ClassCastException issues in PDFBox anymore (they fixed them in the current trunk, for example see this patch from Feb 2007: http://pdfbox.cvs.sourceforge.net/pdfbox/pdfbox/src/org/pdfbox/filter/FlateFilter.java?r1=1.10&r2=1.11 ).

Patch attached. The patch doesn't contain the jars but they are referenced in the patch for completeness. I can add them if needed.

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671117#action_12671117 ] 

Andrzej Bialecki  commented on NUTCH-643:
-----------------------------------------

Fixed in rev. 741558, using CVS HEAD version of PDFBox 0.7.4 from SourceForge. During tests on documents containing images I discovered that it's necessary to add JAI libraries too - this unfortunately increased the size of the plugin.

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Guillaume Smet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guillaume Smet updated NUTCH-643:
---------------------------------

    Affects Version/s:     (was: 0.9.0)
                       1.0.0

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Guillaume Smet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623572#action_12623572 ] 

Guillaume Smet commented on NUTCH-643:
--------------------------------------

Hi Andrzej,

This problem is also fixed in the non-Apache repository of PDFBox (on sf.net - the link I posted is from the sf.net CVS tree). I don't know though if you can build and ship a non released version of PDFBox according to ASF release policy.

Even if we can't solve it in the Nutch tree right now, the problem is now referenced and people can solve it by themselves quite easily.

Regards,

-- 
Guillaume



> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671407#action_12671407 ] 

Hudson commented on NUTCH-643:
------------------------------

Integrated in Nutch-trunk #717 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/717/])
     ClassCastException in PDF parser, upgrade to unofficial PDFBox 0.7.4


> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623519#action_12623519 ] 

Andrzej Bialecki  commented on NUTCH-643:
-----------------------------------------

AFAIK we can't include libraries from projects undergoing incubation, because their legal status is not fully confirmed by ASF. I think we have to wait until PDFBox comes out from the incubation, or to use the latest non-Apache version (which unfortunately doesn't yet address this problem).

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668015#action_12668015 ] 

Andrzej Bialecki  commented on NUTCH-643:
-----------------------------------------

+1. Yes, it's compatible.

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668018#action_12668018 ] 

Andrzej Bialecki  commented on NUTCH-643:
-----------------------------------------

(sorry Guillame, missed your comment) - there is an existing precedent in Nutch source tree, namely the Tika library, which is still incubating. This practice is however frowned upon ;) I'm ok with using the latest SF.net version of PDFBox built from sources, provided we include a notice about the SVN revision of the library. This is probably better than using the version from the incubator and make the legal situation worse.

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-643) ClassCastException in PdfParser on encrypted PDF with empty password

Posted by "Guillaume Smet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guillaume Smet updated NUTCH-643:
---------------------------------

    Attachment: parse-pdf-PDFBox_upgrade.diff

> ClassCastException in PdfParser on encrypted PDF with empty password
> --------------------------------------------------------------------
>
>                 Key: NUTCH-643
>                 URL: https://issues.apache.org/jira/browse/NUTCH-643
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: This problem affects the current trunk too.
>            Reporter: Guillaume Smet
>         Attachments: parse-pdf-PDFBox_upgrade.diff
>
>
> Hi,
> If a PDF document is encrypted with an empty password, the PdfParser should decrypt it using the empty password.
> This behaviour is implemented with the following code:
>       if (pdf.isEncrypted()) {
>         DocumentEncryption decryptor = new DocumentEncryption(pdf);
>         //Just try using the default password and move on
>         decryptor.decryptDocument("");
>       }
> It uses a deprecated API and moreover it seems there is a bug in PDFBox in this deprecated API (we have a ClassCastException in PDFBox) as we have the following error:
> 2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
> 2008-08-07 19:15:56,862 WARN  parse.pdf - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
> 2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to org.pdfbox.pdmodel.encryption.PDStandardEncryption
> Using the new security API, we don't have any error parsing this document and we can get its content:
> 			if (pdf.isEncrypted()) {
> 				// Just try using the default password and move on
> 				pdf.openProtection(new StandardDecryptionMaterial(""));
> 			}
> I attached the patch fixing this problem: it works perfectly with the above document and get rids of the deprecated API.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.