You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Mark Kerzner <ma...@gmail.com> on 2009/04/02 21:30:42 UTC

Re: Testing Tika

If I want to process such protected PDF's, should I hack Tika somehow, or
should I wait for a change? Should I fix it and commit the change?
Thank you,
Mark

On Tue, Mar 31, 2009 at 1:20 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi Jukka,
>
> the problem is mostly the following (I wrote a PDF text extractor in the
> past, too): The current code in TIKA checks after opening the document, if
> a
> PDF document is encrypted, and if so, it tries to set an empty string as
> password. With newer versions of PDFBox, it seems that this is no longer
> needed. It can read encrypted documents that have a password for writing
> without any further checks. Just try to remove this "set empty password"
> part from TIKA and it will work.
>
> I am not really sure, why this bad "hack" is always used together with
> PDFBOX (nutch does this, a lot of examples from the web, too). I was able
> to
> extract text from thousand of PDFs with just this short code snippet:
>
>                Writer out=...
>                PDDocument pd=PDDocument.load(in);
>                try {
>                        PDFTextStripper stripper = new PDFTextStripper();
>                        stripper.writeText(pd,out);
>                } finally {
>                        try {
>                                pd.close();
>                        } catch (IOException e) {}
>                }
>
> So it des not detect encryption or anything else and it was able to extract
> text from encrypted douments. Only documents that have a real password set
> for reading were failing (but this is intended).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> > Sent: Tuesday, March 31, 2009 2:18 AM
> > To: tika-user@lucene.apache.org
> > Subject: Re: Testing Tika
> >
> > Hi,
> >
> > On Mon, Mar 30, 2009 at 7:20 PM, Mark Kerzner <ma...@gmail.com>
> > wrote:
> > > For the attached PDF file I get this exception
> > > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >
> > The underlying PDFBox exception seems to be:
> >
> > Error: The supplied password does not match either the owner or user
> > password in the document.
> >       at
> > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
> >       at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> >
> > I'm able to open the document normally using Acrobat Reader, so I
> > guess PDFBox is simply getting confused about the document.
> >
> > > Neko gives a null pointer exception, and I see this bug submitted in
> > Jira a
> > > few days ago.
> >
> > OK. I think Tika should be better prepared to handle even NPEs and
> > other RuntimeExceptions from parser libraries.
> >
> > BR,
> >
> > Jukka Zitting
>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.

Uwe,

do you still want me to create a JIRA issue for this?

Thank you,
Mark

On Fri, Apr 3, 2009 at 11:24 AM, Mark Kerzner <ma...@gmail.com> wrote:

> Uwe,
> here is the Shakespeare's Midsummer Night's Dream, obviously in the public
> domain, and converted to a PDF that gives you the same problem on
> extraction.
>
> I hope that's good enough, if not - please tell me.
>
> Thank you,
> Mark
>
>
> On Thu, Apr 2, 2009 at 6:32 PM, Mark Kerzner <ma...@gmail.com>wrote:
>
>> I will try to generate one tomorrow. I have a complete PDF program, and I
>> think the whole problem is that it should be "Restricted"
>>
>> Mark <http://shmsoft.com/aboutmark.html>
>>
>>
>> On Thu, Apr 2, 2009 at 5:32 PM, Jukka Zitting <ju...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> On Thu, Apr 2, 2009 at 11:48 PM, Mark Kerzner <ma...@gmail.com>
>>> wrote:
>>> > Uwe, you can use this PDF, because I found it on the web, so it's
>>> public.
>>>
>>> Unfortunately that's not enough for us.
>>>
>>> The document being public makes it OK for you or I to download and
>>> view it (and try PDFBox on it), but that's not enough for Tika as an
>>> open source project to include and redistribute the document. We need
>>> an ALv2-compatible license for everything we include in Tika.
>>>
>>> It would be ideal if someone could generate a simplified PDF document
>>> that illustrates the same problem. Alternatively the best we can do is
>>> just to attach the current example document in the Jira issue (with
>>> the "Attachment not intended for inclusion" option selected) so
>>> everyone who is looking at the change can at least manually download
>>> the document and try it for themselves.
>>>
>>> BR,
>>>
>>> Jukka Zitting
>>>
>>
>>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.

Uwe,
here is the Shakespeare's Midsummer Night's Dream, obviously in the public
domain, and converted to a PDF that gives you the same problem on
extraction.

I hope that's good enough, if not - please tell me.

Thank you,
Mark

On Thu, Apr 2, 2009 at 6:32 PM, Mark Kerzner <ma...@gmail.com> wrote:

> I will try to generate one tomorrow. I have a complete PDF program, and I
> think the whole problem is that it should be "Restricted"
>
> Mark <http://shmsoft.com/aboutmark.html>
>
>
> On Thu, Apr 2, 2009 at 5:32 PM, Jukka Zitting <ju...@gmail.com>wrote:
>
>> Hi,
>>
>> On Thu, Apr 2, 2009 at 11:48 PM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>> > Uwe, you can use this PDF, because I found it on the web, so it's
>> public.
>>
>> Unfortunately that's not enough for us.
>>
>> The document being public makes it OK for you or I to download and
>> view it (and try PDFBox on it), but that's not enough for Tika as an
>> open source project to include and redistribute the document. We need
>> an ALv2-compatible license for everything we include in Tika.
>>
>> It would be ideal if someone could generate a simplified PDF document
>> that illustrates the same problem. Alternatively the best we can do is
>> just to attach the current example document in the Jira issue (with
>> the "Attachment not intended for inclusion" option selected) so
>> everyone who is looking at the change can at least manually download
>> the document and try it for themselves.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.

I will try to generate one tomorrow. I have a complete PDF program, and I
think the whole problem is that it should be "Restricted"

Mark <http://shmsoft.com/aboutmark.html>

On Thu, Apr 2, 2009 at 5:32 PM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Thu, Apr 2, 2009 at 11:48 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > Uwe, you can use this PDF, because I found it on the web, so it's public.
>
> Unfortunately that's not enough for us.
>
> The document being public makes it OK for you or I to download and
> view it (and try PDFBox on it), but that's not enough for Tika as an
> open source project to include and redistribute the document. We need
> an ALv2-compatible license for everything we include in Tika.
>
> It would be ideal if someone could generate a simplified PDF document
> that illustrates the same problem. Alternatively the best we can do is
> just to attach the current example document in the Jira issue (with
> the "Attachment not intended for inclusion" option selected) so
> everyone who is looking at the change can at least manually download
> the document and try it for themselves.
>
> BR,
>
> Jukka Zitting
>

Re: Testing Tika

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Apr 2, 2009 at 11:48 PM, Mark Kerzner <ma...@gmail.com> wrote:
> Uwe, you can use this PDF, because I found it on the web, so it's public.

Unfortunately that's not enough for us.

The document being public makes it OK for you or I to download and
view it (and try PDFBox on it), but that's not enough for Tika as an
open source project to include and redistribute the document. We need
an ALv2-compatible license for everything we include in Tika.

It would be ideal if someone could generate a simplified PDF document
that illustrates the same problem. Alternatively the best we can do is
just to attach the current example document in the Jira issue (with
the "Attachment not intended for inclusion" option selected) so
everyone who is looking at the change can at least manually download
the document and try it for themselves.

BR,

Jukka Zitting

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.

Uwe, you can use this PDF, because I found it on the web, so it's public.
I will create a JIRA issue - but tomorrow, since I am running away (to a
concert :), so if you can wait - fine, if not - you can create your won.

Thank you,
Mark

On Thu, Apr 2, 2009 at 4:42 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  Hi Mark,
>
>
>
> with the attached patch, TIKA can decode encrypted PDFs (I tested it with
> your example). I think this patch should be applied to TIKA and all other
> projects using this empty password hack with newer PDFBox versions. Can you
> create an JIRA issue and I attach the patch after? I would like to add a
> test case to it, too. Can we use your example PDF for the test case?
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Mark Kerzner [mailto:markkerzner@gmail.com]
> *Sent:* Thursday, April 02, 2009 9:31 PM
>
> *To:* tika-user@lucene.apache.org
> *Subject:* Re: Testing Tika
>
>
>
> If I want to process such protected PDF's, should I hack Tika somehow, or
> should I wait for a change? Should I fix it and commit the change?
>
>
>
> Thank you,
>
> Mark
>
> On Tue, Mar 31, 2009 at 1:20 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi Jukka,
>
> the problem is mostly the following (I wrote a PDF text extractor in the
> past, too): The current code in TIKA checks after opening the document, if
> a
> PDF document is encrypted, and if so, it tries to set an empty string as
> password. With newer versions of PDFBox, it seems that this is no longer
> needed. It can read encrypted documents that have a password for writing
> without any further checks. Just try to remove this "set empty password"
> part from TIKA and it will work.
>
> I am not really sure, why this bad "hack" is always used together with
> PDFBOX (nutch does this, a lot of examples from the web, too). I was able
> to
> extract text from thousand of PDFs with just this short code snippet:
>
>                Writer out=...
>                PDDocument pd=PDDocument.load(in);
>                try {
>                        PDFTextStripper stripper = new PDFTextStripper();
>                        stripper.writeText(pd,out);
>                } finally {
>                        try {
>                                pd.close();
>                        } catch (IOException e) {}
>                }
>
> So it des not detect encryption or anything else and it was able to extract
> text from encrypted douments. Only documents that have a real password set
> for reading were failing (but this is intended).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> > Sent: Tuesday, March 31, 2009 2:18 AM
> > To: tika-user@lucene.apache.org
> > Subject: Re: Testing Tika
> >
> > Hi,
> >
> > On Mon, Mar 30, 2009 at 7:20 PM, Mark Kerzner <ma...@gmail.com>
> > wrote:
> > > For the attached PDF file I get this exception
> > > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >
> > The underlying PDFBox exception seems to be:
> >
> > Error: The supplied password does not match either the owner or user
> > password in the document.
> >       at
> > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
> >       at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> >
> > I'm able to open the document normally using Acrobat Reader, so I
> > guess PDFBox is simply getting confused about the document.
> >
> > > Neko gives a null pointer exception, and I see this bug submitted in
> > Jira a
> > > few days ago.
> >
> > OK. I think Tika should be better prepared to handle even NPEs and
> > other RuntimeExceptions from parser libraries.
> >
> > BR,
> >
> > Jukka Zitting
>
>
>

Re: Testing Tika

Posted by Mark Kerzner <ma...@gmail.com>.

Uwe, can you tell me what is the status on this?

On Thu, Apr 2, 2009 at 4:42 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  Hi Mark,
>
>
>
> with the attached patch, TIKA can decode encrypted PDFs (I tested it with
> your example). I think this patch should be applied to TIKA and all other
> projects using this empty password hack with newer PDFBox versions. Can you
> create an JIRA issue and I attach the patch after? I would like to add a
> test case to it, too. Can we use your example PDF for the test case?
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Mark Kerzner [mailto:markkerzner@gmail.com]
> *Sent:* Thursday, April 02, 2009 9:31 PM
>
> *To:* tika-user@lucene.apache.org
> *Subject:* Re: Testing Tika
>
>
>
> If I want to process such protected PDF's, should I hack Tika somehow, or
> should I wait for a change? Should I fix it and commit the change?
>
>
>
> Thank you,
>
> Mark
>
> On Tue, Mar 31, 2009 at 1:20 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi Jukka,
>
> the problem is mostly the following (I wrote a PDF text extractor in the
> past, too): The current code in TIKA checks after opening the document, if
> a
> PDF document is encrypted, and if so, it tries to set an empty string as
> password. With newer versions of PDFBox, it seems that this is no longer
> needed. It can read encrypted documents that have a password for writing
> without any further checks. Just try to remove this "set empty password"
> part from TIKA and it will work.
>
> I am not really sure, why this bad "hack" is always used together with
> PDFBOX (nutch does this, a lot of examples from the web, too). I was able
> to
> extract text from thousand of PDFs with just this short code snippet:
>
>                Writer out=...
>                PDDocument pd=PDDocument.load(in);
>                try {
>                        PDFTextStripper stripper = new PDFTextStripper();
>                        stripper.writeText(pd,out);
>                } finally {
>                        try {
>                                pd.close();
>                        } catch (IOException e) {}
>                }
>
> So it des not detect encryption or anything else and it was able to extract
> text from encrypted douments. Only documents that have a real password set
> for reading were failing (but this is intended).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> > Sent: Tuesday, March 31, 2009 2:18 AM
> > To: tika-user@lucene.apache.org
> > Subject: Re: Testing Tika
> >
> > Hi,
> >
> > On Mon, Mar 30, 2009 at 7:20 PM, Mark Kerzner <ma...@gmail.com>
> > wrote:
> > > For the attached PDF file I get this exception
> > > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >
> > The underlying PDFBox exception seems to be:
> >
> > Error: The supplied password does not match either the owner or user
> > password in the document.
> >       at
> > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
> >       at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> >
> > I'm able to open the document normally using Acrobat Reader, so I
> > guess PDFBox is simply getting confused about the document.
> >
> > > Neko gives a null pointer exception, and I see this bug submitted in
> > Jira a
> > > few days ago.
> >
> > OK. I think Tika should be better prepared to handle even NPEs and
> > other RuntimeExceptions from parser libraries.
> >
> > BR,
> >
> > Jukka Zitting
>
>
>

RE: Testing Tika

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Mark,

with the attached patch, TIKA can decode encrypted PDFs (I tested it with
your example). I think this patch should be applied to TIKA and all other
projects using this empty password hack with newer PDFBox versions. Can you
create an JIRA issue and I attach the patch after? I would like to add a
test case to it, too. Can we use your example PDF for the test case?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

  _____  

From: Mark Kerzner [mailto:markkerzner@gmail.com] 
Sent: Thursday, April 02, 2009 9:31 PM
To: tika-user@lucene.apache.org
Subject: Re: Testing Tika

If I want to process such protected PDF's, should I hack Tika somehow, or
should I wait for a change? Should I fix it and commit the change?

Thank you,

Mark

On Tue, Mar 31, 2009 at 1:20 AM, Uwe Schindler <uw...@thetaphi.de> wrote:

Hi Jukka,

the problem is mostly the following (I wrote a PDF text extractor in the
past, too): The current code in TIKA checks after opening the document, if a
PDF document is encrypted, and if so, it tries to set an empty string as
password. With newer versions of PDFBox, it seems that this is no longer
needed. It can read encrypted documents that have a password for writing
without any further checks. Just try to remove this "set empty password"
part from TIKA and it will work.

I am not really sure, why this bad "hack" is always used together with
PDFBOX (nutch does this, a lot of examples from the web, too). I was able to
extract text from thousand of PDFs with just this short code snippet:

               Writer out=...
               PDDocument pd=PDDocument.load(in);
               try {
                       PDFTextStripper stripper = new PDFTextStripper();
                       stripper.writeText(pd,out);
               } finally {
                       try {
                               pd.close();
                       } catch (IOException e) {}
               }

So it des not detect encryption or anything else and it was able to extract
text from encrypted douments. Only documents that have a real password set
for reading were failing (but this is intended).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Tuesday, March 31, 2009 2:18 AM
> To: tika-user@lucene.apache.org
> Subject: Re: Testing Tika
>
> Hi,
>
> On Mon, Mar 30, 2009 at 7:20 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > For the attached PDF file I get this exception
> > org.apache.tika.exception.TikaException: Unable to extract PDF content
>
> The underlying PDFBox exception seems to be:
>
> Error: The supplied password does not match either the owner or user
> password in the document.
>       at
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
>       at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
>
> I'm able to open the document normally using Acrobat Reader, so I
> guess PDFBox is simply getting confused about the document.
>
> > Neko gives a null pointer exception, and I see this bug submitted in
> Jira a
> > few days ago.
>
> OK. I think Tika should be better prepared to handle even NPEs and
> other RuntimeExceptions from parser libraries.
>
> BR,
>
> Jukka Zitting