You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ehsan Sadeghi <es...@gmail.com> on 2010/06/04 11:51:07 UTC
PDF text extraction problems
Hello,
I have sent this to the Tika Linked before and got an answer from Jukka
Zitting,
It may be that the PDFBox library Tika uses for handling PDF documents is
having a problem with parsing your files. Do you have an example file that
you can share?
BR,
so here is the original mail and attachment.
PDF file 1:
https://docs.google.com/fileview?id=0B2X-v8a_ekanYmMyMzg1NTktMmFlMi00YjU2LTk2OWQtMTg2NTI1YWI4NTZh&hl=en
PDF
https://docs.google.com/fileview?id=0B2X-v8a_ekanMTUyNjExMjUtMTI5Yy00NDc4LTg0YmYtODg4NmNkMGIxMmZk&hl=en
I'm trying to parse a pdf file. I first tried this code
InputStream input = new FileInputStream(new
File(resourceLocation));// the document to be parsed
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
input.close();
then I tried the Tika class
Tika tika = new Tika();
InputStream input = new FileInputStream(new File(resourceLocation));
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);
both of these codes do the exact same thing, they read some of the text in
the PDF file, but leave the rest of the file out?? I tested it with a 1m
file and a 100k file.
I looked around and found this message in the tika mails "Tika
maxStringLength limit reached" where it was suggested that one could add the
maxStringLength by doing this
tika.setMaxStringLength(10*
1024*1024);
no result. Am I doing something wrong?how can I parse the entire file.
cheers
ehsan