You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/01/02 19:11:20 UTC

[Tika Wiki] Update of "CompositeParserDiscussion" by NickBurch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "CompositeParserDiscussion" page has been changed by NickBurch:
https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=7&rev2=8

Comment:
Multiple Parser multiple-metadata discussions from the mailing list

  
  == Supplementary/Additive ==
  Concatenate the results (metadata and content) for several parsers
+ 
+ For Metadata, this merging should be configurable if multiple parsers output the same Metadata Key, between:
+  * First/earliest parser's value(s) win
+  * Last/latest parser's value(s) win
+  * Capture all and return multiple values
+ 
+ ''TODO'' For Content, decide how we could support appending or resetting of the SAX stream
  
  We need a better name for this!
  
@@ -79, +86 @@

  
      <!-- JPEG needs special handling - try+combine everything -->
      <parser class="org.apache.tika.parser.(suppliment)">
+        <params>
+           <!-- If several parsers output the same metadata key, first parser to do so wins -->
+           <param name="metadataPolicy" value="FIRST_WINS" />
+           <!-- If several parsers output the same metadata key, last parser to do so wins -->
+           <!--
+           <param name="metadataPolicy" value="LAST_WINS" />
+            -->
+           <!-- If several parsers output the same metadata key, store all their values -->
+           <!--
+           <param name="metadataPolicy" value="KEEP_ALL" />
+            -->
+        </params>
         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" />
         <parser class="org.apache.tika.parser.image.ImageParser" />
         <parser class="org.apache.tika.parser.jpeg.JpegParser" />