You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2015/08/19 16:50:11 UTC

svn commit: r1696602 - /tika/site/src/site/apt/1.10/configuring.apt

Author: nick
Date: Wed Aug 19 14:50:11 2015
New Revision: 1696602

URL: http://svn.apache.org/r1696602
Log:
Restore the 1.10 configuration page, which seems to have got clobbered with the 1.9 one during the release

Modified:
    tika/site/src/site/apt/1.10/configuring.apt

Modified: tika/site/src/site/apt/1.10/configuring.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.10/configuring.apt?rev=1696602&r1=1696601&r2=1696602&view=diff
==============================================================================
--- tika/site/src/site/apt/1.10/configuring.apt (original)
+++ tika/site/src/site/apt/1.10/configuring.apt Wed Aug 19 14:50:11 2015
@@ -31,21 +31,33 @@ Configuring Tika
 
 * {Configuring Parsers}
 
-~~ TODO Add more on in 1.10, which has more support
+    Through the Tika Config xml, it is possible to have a high degree of control
+    over which parsers are or aren't used, in what order of preferences etc. It 
+    is also possible to override just certain parts, to (for example) have "default
+    except for PDF".
+
+    Currently, it is only possible to have a single parser run against a document.
+    There is on-going discussion around fallback parsers and combining the output
+    of multiple parsers running on a document, but none of these are available yet.
+
+    To override some parser certain default behaviours, include the {{{ DefaultParser }}}
+    in your configuration, with excludes, then add other parser definitions in.
+    To prevent the {{{ DefaultParser }}} (with its auto-discovery) being used, 
+    simply omit it from your config, and list all other parsers you want instead.
 
-    In Tika 1.9, there is some support for configuring Parsers in the Tika Config 
-    xml. You can provide a custom list of parser to use, in a custom order, and you
-    can also force certain mimetypes to be used or not-used for parsers. You can do
-    so with Tika Config something like:
+    To override just some default behaviour, you can use a Tika Config something
+    like this:
 
 ---
 <?xml version="1.0" encoding="UTF-8"?>
 <properties>
   <parsers>
-    <!-- Default Parser for most things, except for 2 mime types -->
+    <!-- Default Parser for most things, except for 2 mime types, and never
+         use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>image/jpeg</mime-exclude>
       <mime-exclude>application/pdf</mime-exclude>
+      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
     </parser>
     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.EmptyParser">
@@ -55,8 +67,8 @@ Configuring Tika
 </properties>
 ---
 
-    In code, the key classes to use to build up your own custom parser
-    heirarchy are 
+    To configure things in code, the key classes to use to build up your own custom 
+    parser heirarchy are 
     {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}},
     {{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}}
     and
@@ -64,11 +76,35 @@ Configuring Tika
 
 * {Configuring Detectors}
 
-~~ TODO Add more on in 1.10, which has more support
+    Through the Tika Config xml, it is possible to have a high degree of control
+    over which detectors are or aren't used, in what order of preferences etc. It 
+    is also possible to override just certain parts, to (for example) have "default
+    except for no POIFS Container Detction".
+
+    To override some detector certain default behaviours, include the 
+    {{{ DefaultDetector }}}, with any {{{ detector-exclude }}} entries you need,
+    in your configuration, then add other detectors definitions in. To prevent 
+    the {{{ DefaultParser }}} (with its auto-discovery) being used, simply omit it 
+    from your config, and list all other detectors you want instead.
 
-    In Tika 1.9, there is limited support for configuring Detectors in the Tika Config 
-    xml. You can provide a custom list of detectors to use, in a custom order, with
-    Tika Config something like:
+    To override just some default behaviour, you can use a Tika Config something
+    like this:
+
+---
+<?xml version="1.0" encoding="UTF-8"?>
+<properties>
+  <detectors>
+    <!-- All detectors except built-in container ones -->
+    <detector class="org.apache.tika.detect.DefaultDetector">
+      <detector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
+      <detector-exclude class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
+    </detector>
+  </detectors>
+</properties>
+---
+
+    Or to just only use certain detectors, you can use a Tika Config something
+    like this:
 
 ---
 <?xml version="1.0" encoding="UTF-8"?>
@@ -103,6 +139,9 @@ Configuring Tika
     While the work on that is ongoing, for now you will need to review the
     {{{./api/}Tika Javadocs}} to see how individual Translators are configured.
 
+~~ When Translators can have their parameters configured, mention here about
+~~ specifying which single one to use in the Tika Config XML
+
 * {Using a Tika Configuration XML file}
 
     However you call Tika, the System Property of <<< tika.config >>> is