You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison A." <al...@gmail.com> on 2016/06/30 04:37:25 UTC
Re: PDFPaser generates gibberish
I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I
am getting gibberish characters between words, it seems they are added to
spacing between words or at the end of the file.
For two column pdf files, this is quite serious, adding too much gibberish.
How can I get rid of this? Any suggestions are welcome.
Allison
RE: PDFPaser generates gibberish
Posted by "Allison, Timothy B." <ta...@mitre.org>.
If you run PDFBox app’s ExtractText on the files, are you getting the same output? If so, might make sense to ask for help from the PDFBox project.
e.g. : http://apache.cs.utah.edu/pdfbox/2.0.2/pdfbox-app-2.0.2.jar
java -jar pdfbox-app-2.0.2.jar ExtractText thispdf.pdf
From: Allison A. [mailto:allison9y@gmail.com]
Sent: Thursday, June 30, 2016 12:37 AM
To: user@tika.apache.org
Subject: Re: PDFPaser generates gibberish
I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I am getting gibberish characters between words, it seems they are added to spacing between words or at the end of the file.
For two column pdf files, this is quite serious, adding too much gibberish.
How can I get rid of this? Any suggestions are welcome.
Allison