You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/03/11 02:57:18 UTC

[Tika Wiki] Trivial Update of "TikaBatchUsage" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TikaBatchUsage" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=5&rev2=6

  = Usage =
  See TikaBatchOverview for a general design overview of tika-batch.
  
- This is all still very much in a dev state.  The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1302|here]].  The current goal is to get this into decent enough shape to make it into Tika 1.8.
+ The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]].  The current goal is to add this into Tika 1.8.
- 
  
  == TikaBatch FileSystem (FS) ==
  For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]].
@@ -15, +14 @@

  You can see the commandline arguments via the regular "-?" or "--help" commands.  There is a separate section at the end for tika-batch options.
  
  In the current dev version.  Tika-app decides if it is in batch mode based on one of two signals:
- 1. The final argument in the commandline args is a directory
+ 
+ 1. There are only two arguments and the first one is an existing directory
+ 
- 2. -srcDir is specified in the commandline
+ 2. -inputDir or -i is specified in the commandline
  
  Once the app knows that it is in batch mode, it converts some of the traditional tika-app commandline arguments for use by org.apache.tika.batch.fs.FSBatchProcessCLI.
  
@@ -24, +25 @@

  
   *Most basic (with output to a directory called "output"):
  
-       java -jar tika-app.X.Y.jar <inputDirectory>
+       java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory>
+ 
+  *Specify input and output directories:
+ 
+       java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir
  
   *Set the number of file consumer threads:
  
-       java -jar tika-app.X.Y.jar -numConsumers 10 <inputDirectory>
+       java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o <outputDirectory>
  
-  *Specify input and output directories:
+  *Output text instead of xml
  
-       java -jar tika-app.X.Y.jar -srcDir /mydata/src/dir -targDir /mydata/output/dir
+       java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>
+ 
+  *Use the RecursiveParserWrapper and store text for each document:
+       java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory>
  
   *Specify jvm args to be used by the child process (prepend a "J" to the regular args):
  
-       java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} <inputDirectory>
+       java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} -i <inputDirectory> -o <outputDirectory>
  
   *Commandline to generate output files for tika-eval...only process those files listed in pdfs_random_50000.csv:
-       java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -targDir <targDir> -srcDir <srcDir> -fileList pdfs_random_50000.csv
+       java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -o <outputDirectory> -i <inputDirectory> -fileList pdfs_random_50000.csv
- 
  
        
+ === Some notes ===
+ 
+  *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666.  If you want to kill all processing, make sure to kill the parent process and then the child process.
+ 
+  *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>").  That should go away when Tika migrates to PDFBox 2.0.
+ 
  
  == TikaBatch Server ==
  Module not yet implemented...want to contribute?