You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mark Kerzner <ma...@gmail.com> on 2009/06/04 21:37:32 UTC

Re: Testing Tika

Uwe, can you tell me what is the status on this?

On Thu, Apr 2, 2009 at 4:42 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

>  Hi Mark,
>
>
>
> with the attached patch, TIKA can decode encrypted PDFs (I tested it with
> your example). I think this patch should be applied to TIKA and all other
> projects using this empty password hack with newer PDFBox versions. Can you
> create an JIRA issue and I attach the patch after? I would like to add a
> test case to it, too. Can we use your example PDF for the test case?
>
>
>
> Uwe
>
>
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>   ------------------------------
>
> *From:* Mark Kerzner [mailto:markkerzner@gmail.com]
> *Sent:* Thursday, April 02, 2009 9:31 PM
>
> *To:* tika-user@lucene.apache.org
> *Subject:* Re: Testing Tika
>
>
>
> If I want to process such protected PDF's, should I hack Tika somehow, or
> should I wait for a change? Should I fix it and commit the change?
>
>
>
> Thank you,
>
> Mark
>
> On Tue, Mar 31, 2009 at 1:20 AM, Uwe Schindler <uw...@thetaphi.de> wrote:
>
> Hi Jukka,
>
> the problem is mostly the following (I wrote a PDF text extractor in the
> past, too): The current code in TIKA checks after opening the document, if
> a
> PDF document is encrypted, and if so, it tries to set an empty string as
> password. With newer versions of PDFBox, it seems that this is no longer
> needed. It can read encrypted documents that have a password for writing
> without any further checks. Just try to remove this "set empty password"
> part from TIKA and it will work.
>
> I am not really sure, why this bad "hack" is always used together with
> PDFBOX (nutch does this, a lot of examples from the web, too). I was able
> to
> extract text from thousand of PDFs with just this short code snippet:
>
>                Writer out=...
>                PDDocument pd=PDDocument.load(in);
>                try {
>                        PDFTextStripper stripper = new PDFTextStripper();
>                        stripper.writeText(pd,out);
>                } finally {
>                        try {
>                                pd.close();
>                        } catch (IOException e) {}
>                }
>
> So it des not detect encryption or anything else and it was able to extract
> text from encrypted douments. Only documents that have a real password set
> for reading were failing (but this is intended).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> > Sent: Tuesday, March 31, 2009 2:18 AM
> > To: tika-user@lucene.apache.org
> > Subject: Re: Testing Tika
> >
> > Hi,
> >
> > On Mon, Mar 30, 2009 at 7:20 PM, Mark Kerzner <ma...@gmail.com>
> > wrote:
> > > For the attached PDF file I get this exception
> > > org.apache.tika.exception.TikaException: Unable to extract PDF content
> >
> > The underlying PDFBox exception seems to be:
> >
> > Error: The supplied password does not match either the owner or user
> > password in the document.
> >       at
> > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
> >       at
> org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> >
> > I'm able to open the document normally using Acrobat Reader, so I
> > guess PDFBox is simply getting confused about the document.
> >
> > > Neko gives a null pointer exception, and I see this bug submitted in
> > Jira a
> > > few days ago.
> >
> > OK. I think Tika should be better prepared to handle even NPEs and
> > other RuntimeExceptions from parser libraries.
> >
> > BR,
> >
> > Jukka Zitting
>
>
>