You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Ian Rogers <ia...@gmail.com> on 2016/04/02 17:35:41 UTC

Parsing read-only PDFs

Hi,

I am using PDFBox 1.8.2 because I installed an available NuGet package for
.Net

My question is this. I am reading in the PDF files with the following
commands:
            PDDocument pdDoc = PDDocument.load(path_to_file);
            java.util.List allPages =
pdDoc.getDocumentCatalog().getAllPages();
            PDPage firstPage = (PDPage)allPages.get(0);
            PDStream contents = firstPage.getContents();
            COSStream content = contents.getStream();
            Debug.WriteLine(content.getStreamTokens());

This works great until there is password security on the PDF, that does not
allow modifying contents but does allow freely reading and copying of the
PDF content. In that case I get an IO exception with the following stack
trace:

   at org.apache.pdfbox.cos.COSStream.doDecode(COSName , Int32 )
   at org.apache.pdfbox.cos.COSStream.doDecode()
   at org.apache.pdfbox.cos.COSStream.getUnfilteredStream()
   at org.apache.pdfbox.pdfparser.PDFStreamParser..ctor(COSStream stream)
   at org.apache.pdfbox.cos.COSStream.getStreamTokens()

I used the utility PDFTextStripper and that seems to parse the PDF fine for
PDF documents with and without the abovementioned password security. I
looked through 1.8.10 source to compare what I am doing, but can't see how
I am going wrong.

Any help or pointers would be much appreciated.

Thanks,
Ian

Re: Parsing read-only PDFs

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 02.04.2016 um 17:35 schrieb Ian Rogers:
> Hi,
>
> I am using PDFBox 1.8.2 because I installed an available NuGet package for
> .Net
>
> My question is this. I am reading in the PDF files with the following
> commands:
>              PDDocument pdDoc = PDDocument.load(path_to_file);

use openProtection()

http://stackoverflow.com/a/29676278/535646

Tilman

>              java.util.List allPages =
> pdDoc.getDocumentCatalog().getAllPages();
>              PDPage firstPage = (PDPage)allPages.get(0);
>              PDStream contents = firstPage.getContents();
>              COSStream content = contents.getStream();
>              Debug.WriteLine(content.getStreamTokens());
>
> This works great until there is password security on the PDF, that does not
> allow modifying contents but does allow freely reading and copying of the
> PDF content. In that case I get an IO exception with the following stack
> trace:
>
>     at org.apache.pdfbox.cos.COSStream.doDecode(COSName , Int32 )
>     at org.apache.pdfbox.cos.COSStream.doDecode()
>     at org.apache.pdfbox.cos.COSStream.getUnfilteredStream()
>     at org.apache.pdfbox.pdfparser.PDFStreamParser..ctor(COSStream stream)
>     at org.apache.pdfbox.cos.COSStream.getStreamTokens()
>
> I used the utility PDFTextStripper and that seems to parse the PDF fine for
> PDF documents with and without the abovementioned password security. I
> looked through 1.8.10 source to compare what I am doing, but can't see how
> I am going wrong.
>
> Any help or pointers would be much appreciated.
>
> Thanks,
> Ian
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org