You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ehsan Sadeghi <es...@gmail.com> on 2010/06/04 11:51:07 UTC

PDF text extraction problems

Hello,
I have sent this to the Tika Linked before and got an answer from Jukka
Zitting,

It may be that the PDFBox library Tika uses for handling PDF documents is
having a problem with parsing your files. Do you have an example file that
you can share?

BR,


so here is the original mail and attachment.
PDF file 1:
https://docs.google.com/fileview?id=0B2X-v8a_ekanYmMyMzg1NTktMmFlMi00YjU2LTk2OWQtMTg2NTI1YWI4NTZh&hl=en

PDF
https://docs.google.com/fileview?id=0B2X-v8a_ekanMTUyNjExMjUtMTI5Yy00NDc4LTg0YmYtODg4NmNkMGIxMmZk&hl=en


I'm trying to parse a pdf file. I first tried this code

          InputStream input = new FileInputStream(new
File(resourceLocation));// the document to be parsed
          ContentHandler textHandler = new BodyContentHandler();
          Metadata metadata = new Metadata();
          PDFParser parser = new PDFParser();
          ParseContext context = new  ParseContext();
          parser.parse(input, textHandler, metadata, context);
          input.close();

then I tried the Tika class

        Tika tika = new Tika();
        InputStream input = new FileInputStream(new File(resourceLocation));
        Metadata metadata = new Metadata();
        String content = tika.parseToString(input, metadata);

both of these codes do the exact same thing, they read some of the text in
the PDF file, but leave the rest of the file out?? I tested it with a 1m
file and a 100k file.
 I looked around and found this message in the tika mails "Tika
maxStringLength limit reached" where it was suggested that one could add the
maxStringLength by doing this
  tika.setMaxStringLength(10*
1024*1024);

no result. Am I doing something wrong?how can I parse the entire file.

cheers
ehsan