You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/03/11 02:57:18 UTC
[Tika Wiki] Trivial Update of "TikaBatchUsage" by TimothyAllison
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchUsage" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=5&rev2=6
= Usage =
See TikaBatchOverview for a general design overview of tika-batch.
- This is all still very much in a dev state. The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1302|here]]. The current goal is to get this into decent enough shape to make it into Tika 1.8.
+ The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]]. The current goal is to add this into Tika 1.8.
-
== TikaBatch FileSystem (FS) ==
For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]].
@@ -15, +14 @@
You can see the commandline arguments via the regular "-?" or "--help" commands. There is a separate section at the end for tika-batch options.
In the current dev version. Tika-app decides if it is in batch mode based on one of two signals:
- 1. The final argument in the commandline args is a directory
+
+ 1. There are only two arguments and the first one is an existing directory
+
- 2. -srcDir is specified in the commandline
+ 2. -inputDir or -i is specified in the commandline
Once the app knows that it is in batch mode, it converts some of the traditional tika-app commandline arguments for use by org.apache.tika.batch.fs.FSBatchProcessCLI.
@@ -24, +25 @@
*Most basic (with output to a directory called "output"):
- java -jar tika-app.X.Y.jar <inputDirectory>
+ java -jar tika-app.X.Y.jar <inputDirectory> <outputDirectory>
+
+ *Specify input and output directories:
+
+ java -jar tika-app.X.Y.jar -i /mydata/src/dir -o /mydata/output/dir
*Set the number of file consumer threads:
- java -jar tika-app.X.Y.jar -numConsumers 10 <inputDirectory>
+ java -jar tika-app.X.Y.jar -numConsumers 10 -i <inputDirectory> -o <outputDirectory>
- *Specify input and output directories:
+ *Output text instead of xml
- java -jar tika-app.X.Y.jar -srcDir /mydata/src/dir -targDir /mydata/output/dir
+ java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>
+
+ *Use the RecursiveParserWrapper and store text for each document:
+ java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory>
*Specify jvm args to be used by the child process (prepend a "J" to the regular args):
- java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} <inputDirectory>
+ java -jar tika-app.X.Y.jar -JXmx2g -JDlog4j.configuration={{file:bin/log4j.xml}} -i <inputDirectory> -o <outputDirectory>
*Commandline to generate output files for tika-eval...only process those files listed in pdfs_random_50000.csv:
- java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -targDir <targDir> -srcDir <srcDir> -fileList pdfs_random_50000.csv
+ java -Dlog4j.configuration=file:bin/log4j_driver.xml -jar tika-app-X.Y.jar -JXmx6g -JDlog4j.configuration=file:bin/log4j.xml -bc tika-batch-config-basic-test.xml -numConsumers 10 -o <outputDirectory> -i <inputDirectory> -fileList pdfs_random_50000.csv
-
+ === Some notes ===
+
+ *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666. If you want to kill all processing, make sure to kill the parent process and then the child process.
+
+ *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>"). That should go away when Tika migrates to PDFBox 2.0.
+
== TikaBatch Server ==
Module not yet implemented...want to contribute?