You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Peter Murray-Rust <pm...@cam.ac.uk> on 2015/04/19 12:29:46 UTC

Problem reading PDF: encrypted document and unknown compression method

I am trying to extract text from
https://hal.archives-ouvertes.fr/pastel-00003992/document
using PDFTextStripper (pdfbox V1.8.8)

I can visually read this (263 pages) on AdobeReader on MacOSX, but PDFBox
gives the following output.

495  [main] INFO  org.apache.pdfbox.pdfparser.PDFParser  - Document is
encrypted
656  [main] ERROR org.apache.pdfbox.filter.FlateFilter  - FlateFilter: stop
reading corrupt stream due to a DataFormatException
[8 repeats snipped]
656  [main] ERROR org.apache.pdfbox.filter.FlateFilter  - FlateFilter: stop
reading corrupt stream due to a DataFormatException
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at
org.xmlcml.args.DefaultArgProcessor.runRunMethod(DefaultArgProcessor.java:597)
    at
org.xmlcml.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:463)
    at
org.xmlcml.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:663)
    at org.xmlcml.norma.Norma.run(Norma.java:28)
    at org.xmlcml.norma.Prototypes.runHalThesis1(Prototypes.java:11)
    at org.xmlcml.norma.Prototypes.main(Prototypes.java:6)
Caused by: java.lang.RuntimeException: Cannot transform PDF
examples/theses/HalThesis1/fulltext.pdf
    at
org.xmlcml.norma.NormaTransformer.applyPDF2TXTToCMLDir(NormaTransformer.java:87)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:69)
    at
org.xmlcml.norma.NormaArgProcessor.transform(NormaArgProcessor.java:172)
    ... 10 more
Caused by: java.io.IOException
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
    at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:225)
    at
org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:197)
    at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:117)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
    at
org.xmlcml.norma.input.pdf.PDF2TXTConverter.readPDF(PDF2TXTConverter.java:19)
    at
org.xmlcml.norma.NormaTransformer.applyPDF2TXTToCMLDir(NormaTransformer.java:85)
    ... 12 more
Caused by: java.util.zip.DataFormatException: unknown compression method
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:259)
    at java.util.zip.Inflater.inflate(Inflater.java:280)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:128)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:101)
    ... 26 more
Exception in thread "main" java.lang.RuntimeException: cannot process
argument: --xsl (DataFormatException: unknown compression method)
    at
org.xmlcml.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:466)
    at
org.xmlcml.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:663)
    at org.xmlcml.norma.Norma.run(Norma.java:28)
    at org.xmlcml.norma.Prototypes.runHalThesis1(Prototypes.java:11)
    at org.xmlcml.norma.Prototypes.main(Prototypes.java:6)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at
org.xmlcml.args.DefaultArgProcessor.runRunMethod(DefaultArgProcessor.java:597)
    at
org.xmlcml.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:463)
    ... 4 more
Caused by: java.lang.RuntimeException: Cannot transform PDF
examples/theses/HalThesis1/fulltext.pdf
    at
org.xmlcml.norma.NormaTransformer.applyPDF2TXTToCMLDir(NormaTransformer.java:87)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:69)
    at
org.xmlcml.norma.NormaArgProcessor.transform(NormaArgProcessor.java:172)
    ... 10 more
Caused by: java.io.IOException
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:109)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:379)
    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:291)
    at
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:225)
    at
org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:197)
    at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:117)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
    at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
    at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
    at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
    at
org.xmlcml.norma.input.pdf.PDF2TXTConverter.readPDF(PDF2TXTConverter.java:19)
    at
org.xmlcml.norma.NormaTransformer.applyPDF2TXTToCMLDir(NormaTransformer.java:85)
    ... 12 more
Caused by: java.util.zip.DataFormatException: unknown compression method
    at java.util.zip.Inflater.inflateBytes(Native Method)
    at java.util.zip.Inflater.inflate(Inflater.java:259)
    at java.util.zip.Inflater.inflate(Inflater.java:280)
    at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:128)
    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:101)
    ... 26 more

Is this a problem of encryption, or a broken PDF that Adobe can somehow
read or some other problem?

Many thanks


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Problem reading PDF: encrypted document and unknown compression method

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 19.04.2015 um 13:17 schrieb Peter Murray-Rust:
> Thank you
>
>
> On Sun, Apr 19, 2015 at 12:03 PM, Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>
> Am 19.04.2015 um 12:29 schrieb Peter Murray-Rust:
>> Did you decrypt the file? Did you either load the file with loadNonSeq(),
>> or with load() and then call openProtection()?
>>
> I thought I had used  loadNonSeq() but I will check. This is probably the
> problem.
>
>> Current version is 1.8.9
>>
> Thanks
>
> Please could you explain what "encrypted" means in this sense? The document
> is readable by all and there is no need for a password. Is PDFBox (and
> AdobeReader) bypassing security, or is this not really security?

Encrypted PDF have user and owner password. User password is to view the 
file with some restrictions, owner password to have no restrictions. 
Your file has an empty user password.

And yes, this is not really security:
https://www.cs.cmu.edu/~dst/Adobe/Gallery/anon21jul01-pdf-encryption.txt

Tilman

>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Problem reading PDF: encrypted document and unknown compression method

Posted by Peter Murray-Rust <pm...@cam.ac.uk>.

Thank you

On Sun, Apr 19, 2015 at 12:03 PM, Tilman Hausherr <TH...@t-online.de>
wrote:

Am 19.04.2015 um 12:29 schrieb Peter Murray-Rust:
>
> Did you decrypt the file? Did you either load the file with loadNonSeq(),
> or with load() and then call openProtection()?
>

I thought I had used  loadNonSeq() but I will check. This is probably the
problem.

>
> Current version is 1.8.9
>

Thanks

Please could you explain what "encrypted" means in this sense? The document
is readable by all and there is no need for a password. Is PDFBox (and
AdobeReader) bypassing security, or is this not really security?

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Re: Problem reading PDF: encrypted document and unknown compression method

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 19.04.2015 um 12:29 schrieb Peter Murray-Rust:
> I am trying to extract text from
> https://hal.archives-ouvertes.fr/pastel-00003992/document
> using PDFTextStripper (pdfbox V1.8.8)
>
> I can visually read this (263 pages) on AdobeReader on MacOSX, but PDFBox
> gives the following output.
>
> 495  [main] INFO  org.apache.pdfbox.pdfparser.PDFParser  - Document is
> encrypted


Did you decrypt the file? Did you either load the file with 
loadNonSeq(), or with load() and then call openProtection()?

Current version is 1.8.9

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org