You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2015/08/19 16:50:11 UTC
svn commit: r1696602 - /tika/site/src/site/apt/1.10/configuring.apt
Author: nick
Date: Wed Aug 19 14:50:11 2015
New Revision: 1696602
URL: http://svn.apache.org/r1696602
Log:
Restore the 1.10 configuration page, which seems to have got clobbered with the 1.9 one during the release
Modified:
tika/site/src/site/apt/1.10/configuring.apt
Modified: tika/site/src/site/apt/1.10/configuring.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.10/configuring.apt?rev=1696602&r1=1696601&r2=1696602&view=diff
==============================================================================
--- tika/site/src/site/apt/1.10/configuring.apt (original)
+++ tika/site/src/site/apt/1.10/configuring.apt Wed Aug 19 14:50:11 2015
@@ -31,21 +31,33 @@ Configuring Tika
* {Configuring Parsers}
-~~ TODO Add more on in 1.10, which has more support
+ Through the Tika Config xml, it is possible to have a high degree of control
+ over which parsers are or aren't used, in what order of preferences etc. It
+ is also possible to override just certain parts, to (for example) have "default
+ except for PDF".
+
+ Currently, it is only possible to have a single parser run against a document.
+ There is on-going discussion around fallback parsers and combining the output
+ of multiple parsers running on a document, but none of these are available yet.
+
+ To override some parser certain default behaviours, include the {{{ DefaultParser }}}
+ in your configuration, with excludes, then add other parser definitions in.
+ To prevent the {{{ DefaultParser }}} (with its auto-discovery) being used,
+ simply omit it from your config, and list all other parsers you want instead.
- In Tika 1.9, there is some support for configuring Parsers in the Tika Config
- xml. You can provide a custom list of parser to use, in a custom order, and you
- can also force certain mimetypes to be used or not-used for parsers. You can do
- so with Tika Config something like:
+ To override just some default behaviour, you can use a Tika Config something
+ like this:
---
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
- <!-- Default Parser for most things, except for 2 mime types -->
+ <!-- Default Parser for most things, except for 2 mime types, and never
+ use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
+ <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
</parser>
<!-- Use a different parser for PDF -->
<parser class="org.apache.tika.parser.EmptyParser">
@@ -55,8 +67,8 @@ Configuring Tika
</properties>
---
- In code, the key classes to use to build up your own custom parser
- heirarchy are
+ To configure things in code, the key classes to use to build up your own custom
+ parser heirarchy are
{{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}},
{{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}}
and
@@ -64,11 +76,35 @@ Configuring Tika
* {Configuring Detectors}
-~~ TODO Add more on in 1.10, which has more support
+ Through the Tika Config xml, it is possible to have a high degree of control
+ over which detectors are or aren't used, in what order of preferences etc. It
+ is also possible to override just certain parts, to (for example) have "default
+ except for no POIFS Container Detction".
+
+ To override some detector certain default behaviours, include the
+ {{{ DefaultDetector }}}, with any {{{ detector-exclude }}} entries you need,
+ in your configuration, then add other detectors definitions in. To prevent
+ the {{{ DefaultParser }}} (with its auto-discovery) being used, simply omit it
+ from your config, and list all other detectors you want instead.
- In Tika 1.9, there is limited support for configuring Detectors in the Tika Config
- xml. You can provide a custom list of detectors to use, in a custom order, with
- Tika Config something like:
+ To override just some default behaviour, you can use a Tika Config something
+ like this:
+
+---
+<?xml version="1.0" encoding="UTF-8"?>
+<properties>
+ <detectors>
+ <!-- All detectors except built-in container ones -->
+ <detector class="org.apache.tika.detect.DefaultDetector">
+ <detector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
+ <detector-exclude class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
+ </detector>
+ </detectors>
+</properties>
+---
+
+ Or to just only use certain detectors, you can use a Tika Config something
+ like this:
---
<?xml version="1.0" encoding="UTF-8"?>
@@ -103,6 +139,9 @@ Configuring Tika
While the work on that is ongoing, for now you will need to review the
{{{./api/}Tika Javadocs}} to see how individual Translators are configured.
+~~ When Translators can have their parameters configured, mention here about
+~~ specifying which single one to use in the Tika Config XML
+
* {Using a Tika Configuration XML file}
However you call Tika, the System Property of <<< tika.config >>> is