You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Clemens Wyss DEV <cl...@mysign.ch> on 2015/05/08 17:33:02 UTC

extracting text from an "encrypted" pdf

When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:

pdfDocument = PDDocument.load( TIKA_FILES_DIR + "doc1.pdf" ); // "dauertewig.pdf" );			
PDFTextStripper pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
parser = new AutoDetectParser();
context.set( Parser.class, parser );
parser.parse( is, handler, metadata, context );
parsedText = handler.toString();

I get to see some text/content oft he very pdf. 

1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")? 
2) Does the second approach possibly return more than text? Blobs? Binary data?

RE: extracting text from an "encrypted" pdf

Posted by "Allison, Timothy B." <ta...@mitre.org>.

PDF encryption and access permissions are tricky (see, e.g., the discussion and links here: https://issues.apache.org/jira/browse/TIKA-1489 ). There are potentially two passwords for a PDF document, the owner password and the user password. Often, the user password is set to the empty string...this allows the owner to modify the document but can effectively give "read" access to the user.

Aside from encryption, but related to it, a PDF file has various AccessPermissions. Among other permissions, an owner can specify whether or not text should be extracted and/or whether or not text should be extracted for accessibility. As of Tika 1.8, you can have Tika respect these permissions by sending in an AccessChecker via the ParseContext.

1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?

If you only want text from the PDFDocument (not attachments/embedded documents) and you are only parsing PDFs, then it might make sense to use pure PDFBox. <unconfirmed> I haven't checked recently, but I _think_ that Tika may be pulling out some text from annotations or maybe AcroFields that PDFTextStripper isn't </unconfirmed>. ..I can look into this if it matters to you. Tika also extracts normalized metadata and does a bit more with metadata than if you were using the PDFTextStripper.

2) Does the second approach possibly return more than text? Blobs? Binary data?
The second approach will leverage the full power of Tika to extract content from embedded documents/attachments. The first approach will only extract text from the outer pdf document. You can extract binary data (embedded images or other embedded files) in Tika by sending in an EmbeddedDocumentExtractor instead of the Parser.class.