You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Bowyer <pe...@mapledesign.co.uk> on 2014/12/11 18:42:51 UTC

Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext
extracts the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6
@updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6
@updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable
to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException:
Input length must be multiple of 16 when decrypting with padded cipher
        at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be
multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at
com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more




I searched the pdfbox issue tracker and found
https://issues.apache.org/jira/browse/PDFBOX-2469 and
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to
related issues. The ticket status says a number of these issues are fixed
in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set
<pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build
either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec
<<< FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest):
Non-Sequential Parser failed on test file
/root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to
extract PDF content


System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4;
2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family:
"unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts
I've tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter

RE: Encrypted PDF issues & build issues

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Upgrade just made in Tika trunk.  The integration required more than changing the one test…Sorry about that!

Let us know if there are any surprises with the upgrade.

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Thursday, December 11, 2014 2:41 PM
To: user@tika.apache.org
Subject: RE: Encrypted PDF issues & build issues

Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was just taken for that, so that will be out very soon.  Last I looked at integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test (I think?) in Tika…which is why you’re getting a failed build.  Your error message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build with 1.8.8-SNAPSHOT, let me know.

Best,

           Tim

From: Peter Bowyer [mailto:peter@mapledesign.co.uk]
Sent: Thursday, December 11, 2014 12:43 PM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs

PDF Testcases pass, but fail on my own encrypted PDF (sample file at https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6 @updates

Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more

I searched the pdfbox issue tracker and found https://issues.apache.org/jira/browse/PDFBOX-2469 and https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to related issues. The ticket status says a number of these issues are fixed in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set <pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit `tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties` and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Non-Sequential Parser failed on test file /root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content

System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter

RE: Encrypted PDF issues & build issues

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was just taken for that, so that will be out very soon.  Last I looked at integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test (I think?) in Tika…which is why you’re getting a failed build.  Your error message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build with 1.8.8-SNAPSHOT, let me know.

Best,

           Tim

From: Peter Bowyer [mailto:peter@mapledesign.co.uk]
Sent: Thursday, December 11, 2014 12:43 PM
To: user@tika.apache.org
Subject: Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6 @updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more




I searched the pdfbox issue tracker and found https://issues.apache.org/jira/browse/PDFBOX-2469 and https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to related issues. The ticket status says a number of these issues are fixed in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set <pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit `tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties` and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Non-Sequential Parser failed on test file /root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content


System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter