You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2016/09/15 16:50:31 UTC
[Tika Wiki] Update of "Troubleshooting Tika" by TimothyAllison
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "Troubleshooting Tika" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/Troubleshooting%20Tika?action=diff&rev1=10&rev2=11
== PDF Text Problems ==
If Tika isn't extracting the right text from a PDF, and/or is giving errors, the first thing to do is identify if this is a Tika issue, or an issue with the underlying Apache PDFBox library used.
- To check, grab the latest [[http://pdfbox.apache.org/download.cgi|Apache PDFBox pdfbox-app jar]] and use the [[http://pdfbox.apache.org/2.0/commandline.html#extracttext|ExtractText command line tool]] on your problematic PDF.
+ To check, grab the latest [[http://pdfbox.apache.org/download.cgi|Apache PDFBox pdfbox-app jar]] and use the [[http://pdfbox.apache.org/2.0/commandline.html#extracttext|ExtractText command line tool]] on your problematic PDF:
+ {{{
+ java -jar pdfbox-app.X.Y.jar ExtractText problematicPDF.pdf
+ }}}
If that shows the same problem, it's a PDFBox bug. Please [[http://pdfbox.apache.org/support.html|file an Apache PDFBox bug report]] and attach at least one failing file to the bug. When that gets fixed, Tika will pick up the new release and will get the fix