You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Clemens Wyss DEV <cl...@mysign.ch> on 2015/05/08 17:36:24 UTC

extracting text from an "encrypted" pdf

When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:

pdfDocument = PDDocument.load( is );
PDFTextStripper pdfStripper = new PDFTextStripper(); 
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 ); 
ParseContext context = new ParseContext(); 
parser = new AutoDetectParser(); 
context.set( Parser.class, parser );
 parser.parse( is, handler, metadata, context ); 
parsedText = handler.toString();

I get to see the text/content of the very pdf. 

1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")? 
2) Does the second approach possibly return "more than text"? Blobs? Binary data?

Re: extracting text from an "encrypted" pdf

Posted by John Hewson <jo...@jahewson.com>.

Great!

— John

> On 8 May 2015, at 14:49, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 08.05.2015 um 23:47 schrieb John Hewson:
>> Can’t we make PDFBox open the document with an empty password? What’s the story for 2.0?
> 
> In 2.0 it opens immediately. Same in 1.8 when using the loadNonSeq().
> 
> Tilman
> 
>> 
>> — John
>> 
>>> On 8 May 2015, at 08:52, Tilman Hausherr <TH...@t-online.de> wrote:
>>> 
>>> Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
>>>> Thx for the very fast answer.
>>>>> new StandardDecryptionMaterial( password );
>>>> I have no password. The pdf is a public user manual.
>>> Use an empty password :-)
>>> 
>>> Tilman
>>> 
>>>>> That is TIKA, isn't it?
>>>> True
>>>> 
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
>>>> Gesendet: Freitag, 8. Mai 2015 17:44
>>>> An: users@pdfbox.apache.org
>>>> Betreff: Re: extracting text from an "encrypted" pdf
>>>> 
>>>> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>>>>> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>>>>> 
>>>>> pdfDocument = PDDocument.load( is );
>>>> add
>>>> if( document.isEncrypted() )
>>>> {
>>>>   StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); }
>>>> 
>>>> or use loadNonSeq()
>>>> 
>>>>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>>>>> pdfStripper.getText( pdfDocument );
>>>>> 
>>>>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>>>>> 
>>>>> When, on the other hand, I do:
>>>>> 
>>>>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>>>>> context = new ParseContext(); parser = new AutoDetectParser();
>>>>> context.set( Parser.class, parser );
>>>>>   parser.parse( is, handler, metadata, context ); parsedText =
>>>>> handler.toString();
>>>>> 
>>>>> I get to see the text/content of the very pdf.
>>>>> 
>>>>> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
>>>> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>>>> 
>>>>>   2) Does the second approach possibly return "more than text"? Blobs? Binary data?
>>>> That is TIKA, isn't it?
>>>> 
>>>> Tilman
>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org> <mailto:users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>>
>>> For additional commands, e-mail: users-help@pdfbox.apache.org <ma...@pdfbox.apache.org> <mailto:users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: dev-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

Re: extracting text from an "encrypted" pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.05.2015 um 23:47 schrieb John Hewson:
> Can’t we make PDFBox open the document with an empty password? What’s the story for 2.0?

In 2.0 it opens immediately. Same in 1.8 when using the loadNonSeq().

Tilman

>
> — John
>
>> On 8 May 2015, at 08:52, Tilman Hausherr <TH...@t-online.de> wrote:
>>
>> Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
>>> Thx for the very fast answer.
>>>> new StandardDecryptionMaterial( password );
>>> I have no password. The pdf is a public user manual.
>> Use an empty password :-)
>>
>> Tilman
>>
>>>> That is TIKA, isn't it?
>>> True
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Gesendet: Freitag, 8. Mai 2015 17:44
>>> An: users@pdfbox.apache.org
>>> Betreff: Re: extracting text from an "encrypted" pdf
>>>
>>> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>>>> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>>>>
>>>> pdfDocument = PDDocument.load( is );
>>> add
>>> if( document.isEncrypted() )
>>> {
>>>    StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); }
>>>
>>> or use loadNonSeq()
>>>
>>>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>>>> pdfStripper.getText( pdfDocument );
>>>>
>>>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>>>>
>>>> When, on the other hand, I do:
>>>>
>>>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>>>> context = new ParseContext(); parser = new AutoDetectParser();
>>>> context.set( Parser.class, parser );
>>>>    parser.parse( is, handler, metadata, context ); parsedText =
>>>> handler.toString();
>>>>
>>>> I get to see the text/content of the very pdf.
>>>>
>>>> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
>>> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>>>
>>>>    2) Does the second approach possibly return "more than text"? Blobs? Binary data?
>>> That is TIKA, isn't it?
>>>
>>> Tilman
>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
>> For additional commands, e-mail: users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: extracting text from an "encrypted" pdf

Posted by John Hewson <jo...@jahewson.com>.

Can’t we make PDFBox open the document with an empty password? What’s the story for 2.0?

— John

> On 8 May 2015, at 08:52, Tilman Hausherr <TH...@t-online.de> wrote:
> 
> Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
>> Thx for the very fast answer.
>>> new StandardDecryptionMaterial( password );
>> I have no password. The pdf is a public user manual.
> 
> Use an empty password :-)
> 
> Tilman
> 
>> 
>>> That is TIKA, isn't it?
>> True
>> 
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Gesendet: Freitag, 8. Mai 2015 17:44
>> An: users@pdfbox.apache.org
>> Betreff: Re: extracting text from an "encrypted" pdf
>> 
>> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>>> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>>> 
>>> pdfDocument = PDDocument.load( is );
>> add
>> if( document.isEncrypted() )
>> {
>>   StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); }
>> 
>> or use loadNonSeq()
>> 
>>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>>> pdfStripper.getText( pdfDocument );
>>> 
>>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>>> 
>>> When, on the other hand, I do:
>>> 
>>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>>> context = new ParseContext(); parser = new AutoDetectParser();
>>> context.set( Parser.class, parser );
>>>   parser.parse( is, handler, metadata, context ); parsedText =
>>> handler.toString();
>>> 
>>> I get to see the text/content of the very pdf.
>>> 
>>> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
>> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>> 
>>>   2) Does the second approach possibly return "more than text"? Blobs? Binary data?
>> That is TIKA, isn't it?
>> 
>> Tilman
>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

Re: extracting text from an "encrypted" pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
> Thx for the very fast answer.
>> new StandardDecryptionMaterial( password );
> I have no password. The pdf is a public user manual.

Use an empty password :-)

Tilman

>
>> That is TIKA, isn't it?
> True
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Freitag, 8. Mai 2015 17:44
> An: users@pdfbox.apache.org
> Betreff: Re: extracting text from an "encrypted" pdf
>
> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>>
>> pdfDocument = PDDocument.load( is );
> add
> if( document.isEncrypted() )
> {
>    StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); }
>
> or use loadNonSeq()
>
>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>> pdfStripper.getText( pdfDocument );
>>
>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>>
>> When, on the other hand, I do:
>>
>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>> context = new ParseContext(); parser = new AutoDetectParser();
>> context.set( Parser.class, parser );
>>    parser.parse( is, handler, metadata, context ); parsedText =
>> handler.toString();
>>
>> I get to see the text/content of the very pdf.
>>
>> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>
>>    
>> 2) Does the second approach possibly return "more than text"? Blobs? Binary data?
> That is TIKA, isn't it?
>
> Tilman
>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

AW: extracting text from an "encrypted" pdf

Posted by Clemens Wyss DEV <cl...@mysign.ch>.

Thx for the very fast answer. 
> new StandardDecryptionMaterial( password );
I have no password. The pdf is a public user manual.

> That is TIKA, isn't it?
True


-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:THausherr@t-online.de] 
Gesendet: Freitag, 8. Mai 2015 17:44
An: users@pdfbox.apache.org
Betreff: Re: extracting text from an "encrypted" pdf

Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>
> pdfDocument = PDDocument.load( is );

add
if( document.isEncrypted() )
{
  StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); }

or use loadNonSeq()

> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText = 
> pdfStripper.getText( pdfDocument );
>
> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>
> When, on the other hand, I do:
>
> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext 
> context = new ParseContext(); parser = new AutoDetectParser(); 
> context.set( Parser.class, parser );
>   parser.parse( is, handler, metadata, context ); parsedText = 
> handler.toString();
>
> I get to see the text/content of the very pdf.
>
> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date

>   
> 2) Does the second approach possibly return "more than text"? Blobs? Binary data?

That is TIKA, isn't it?

Tilman

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: extracting text from an "encrypted" pdf

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
> When I try to extract an "encrypted" (which can be read in AcrobatReader) document with:
>
> pdfDocument = PDDocument.load( is );

add
if( document.isEncrypted() )
{
  StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( 
password );
document.openProtection( sdm );
}

or use loadNonSeq()

> PDFTextStripper pdfStripper = new PDFTextStripper();
> parsedText = pdfStripper.getText( pdfDocument );
>
> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is encrypted" is logged.
>
> When, on the other hand, I do:
>
> ContentHandler handler = new BodyContentHandler( -1 );
> ParseContext context = new ParseContext();
> parser = new AutoDetectParser();
> context.set( Parser.class, parser );
>   parser.parse( is, handler, metadata, context );
> parsedText = handler.toString();
>
> I get to see the text/content of the very pdf.
>
> 1) What ist he preferred way to extract text from a pdf("-that-can-be-read-in-AcrobatReader")?
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date

>   
> 2) Does the second approach possibly return "more than text"? Blobs? Binary data?

That is TIKA, isn't it?

Tilman

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org