You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/07 15:59:29 UTC
[Tika Wiki] Update of "TikaBatchUsage" by TimothyAllison
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaBatchUsage" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaBatchUsage?action=diff&rev1=6&rev2=7
= Usage =
See TikaBatchOverview for a general design overview of tika-batch.
- The code is currently available [[https://github.com/tballison/tika/tree/TIKA-1330|here]]. The current goal is to add this into Tika 1.8.
+ tika-batch is now available in trunk and will be available in Tika 1.8.
== TikaBatch FileSystem (FS) ==
For expert users who don't want to use tika-app or who might want to do custom extensions, there are example driver files and logging config files available in [[https://github.com/tballison/tika/tree/TIKA-1302/tika-batch/src/main/examples|here]].
@@ -39, +39 @@
java -jar tika-app.X.Y.jar -t -i <inputDirectory> -o <outputDirectory>
- *Use the RecursiveParserWrapper and store text for each document:
+ *Use the !RecursiveParserWrapper and store text for each document:
java -jar tika-app.X.Y.jar -J -t -i <inputDirectory> -o <outputDirectory>
*Specify jvm args to be used by the child process (prepend a "J" to the regular args):
@@ -52, +52 @@
=== Some notes ===
- *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=666. If you want to kill all processing, make sure to kill the parent process and then the child process.
+ *The watchdog process will restart the child process unless the child process exits with a "do not restart value"=254. If you want to kill all processing, make sure to kill the parent process and then the child process.
- *Because of a feature in javax's xml parser and the way the parser is configured in PDFBox 1.8.8, it is common to see error messages (e.g. Fatal Error :21:9: Element type "pdfx:" must be followed by either attribute specifications, ">" or "/>"). That should go away when Tika migrates to PDFBox 2.0.
== TikaBatch Server ==
@@ -64, +63 @@
== TikaBatch Hadoop ==
Module not yet implemented within Tika project...want to contribute?
+ See TikaInHadoop.
- Some external project links and blogs:
- *[[http://svn.apache.org/repos/asf/oodt/trunk/crawler|Apache OODT Crawler]]
- *[[https://github.com/DigitalPebble/behemoth|DigitalPebble]]
- *[[http://openpreservation.org/knowledge/blogs/2014/03/21/tika-ride-characterising-web-content-nanite/|Nanite]]