You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/07 15:59:29 UTC

[Tika Wiki] Update of "TikaBatchUsage" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaBatchUsage" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=6&rev2=7

  = Usage =
  See TikaBatchOverview for a general design overview of tika-batch.
  
- The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]].  The current goal is to add this into Tika 1.8.
+ tika-batch is now available in trunk and will be available in Tika 1.8.
  
  == TikaBatch FileSystem (FS) ==
  For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]].
@@ -39, +39 @@

  
        java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>
  
-  *Use the RecursiveParserWrapper and store text for each document:
+  *Use the !RecursiveParserWrapper and store text for each document:
        java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory>
  
   *Specify jvm args to be used by the child process (prepend a "J" to the regular args):
@@ -52, +52 @@

        
  === Some notes ===
  
-  *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666.  If you want to kill all processing, make sure to kill the parent process and then the child process.
+  *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=254.  If you want to kill all processing, make sure to kill the parent process and then the child process.
  
-  *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>").  That should go away when Tika migrates to PDFBox 2.0.
  
  
  == TikaBatch Server ==
@@ -64, +63 @@

  
  == TikaBatch Hadoop ==
  Module not yet implemented within Tika project...want to contribute?
+ See TikaInHadoop.
- Some external project links and blogs:
-  *[[http://svn.apache.org/repos/asf/oodt/trunk/crawler|Apache OODT Crawler]]
-  *[[https://github.com/DigitalPebble/behemoth|DigitalPebble]]
-  *[[http://openpreservation.org/knowledge/blogs/2014/03/21/tika-ride-characterising-web-content-nanite/|Nanite]]