You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison A." <al...@gmail.com> on 2016/06/30 04:37:25 UTC

Re: PDFPaser generates gibberish

I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I
am getting gibberish characters between words, it seems they are added to
spacing between words or at the end of the file.

For two column pdf files, this is quite serious, adding too much gibberish.

How can I get rid of this? Any suggestions are welcome.

Allison

RE: PDFPaser generates gibberish

Posted by "Allison, Timothy B." <ta...@mitre.org>.
If you run PDFBox app’s ExtractText on the files, are you getting the same output?  If so, might make sense to ask for help from the PDFBox project.

e.g. : http://apache.cs.utah.edu/pdfbox/2.0.2/pdfbox-app-2.0.2.jar

java -jar pdfbox-app-2.0.2.jar ExtractText thispdf.pdf

From: Allison A. [mailto:allison9y@gmail.com]
Sent: Thursday, June 30, 2016 12:37 AM
To: user@tika.apache.org
Subject: Re: PDFPaser generates gibberish

I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I am getting gibberish characters between words, it seems they are added to spacing between words or at the end of the file.

For two column pdf files, this is quite serious, adding too much gibberish.

How can I get rid of this? Any suggestions are welcome.

Allison