You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2016/07/26 15:09:15 UTC

[Tika Wiki] Update of "Troubleshooting Tika" by NickBurch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "Troubleshooting Tika" page has been changed by NickBurch:
https://wiki.apache.org/tika/Troubleshooting%20Tika?action=diff&rev1=8&rev2=9

Comment:
PDF text issues

   * Make sure Tika is able to correctly detect your file's type, see '''Content Incorrectly Detected'''
   * Make sure Tika used the parser you meant it to, see '''Wrong Parser Used'''
   * Make sure you're actually using the version of Tika you meant to use! See '''Identifying your Tika Version'''
+  * Problems with a PDF? See '''PDF Text Problems'''
  
  == No Content Extracted ==
   * Make sure Tika is able to correctly detect your file's type, see '''Content Incorrectly Detected'''
@@ -239, +240 @@

  
  ''TODO describe how to use a ServiceLoader.LoadErrorHandler.ERROR to trigger an exception''
  
+ == PDF Text Problems ==
+ If Tika isn't extracting the right text from a PDF, and/or is giving errors, the first thing to do is identify if this is a Tika issue, or an issue with the underlying Apache PDFBox library used.
+ 
+ To check, grab the latest [[http://pdfbox.apache.org/download.cgi|Apache PDFBox pdfbox-app jar]] and use the [[http://pdfbox.apache.org/2.0/commandline.html#extracttext|ExtractText command line tool]] on your problematic PDF. 
+ 
+ If that shows the same problem, it's a PDFBox bug. Please [[http://pdfbox.apache.org/support.html|file an Apache PDFBox bug report]] and attach at least one failing file to the bug. When that gets fixed, Tika will pick up the new release and will get the fix
+ 
+ If the PDFBox ExtractText works fine, it's likely a Tika bug. Please [[http://tika.apache.org/contribute.html|report an Apache Tika bug]], attach at least one failing file, and mention that PDFBox ExtractText works fine
+