You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2018/01/02 19:11:20 UTC
[Tika Wiki] Update of "CompositeParserDiscussion" by NickBurch
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "CompositeParserDiscussion" page has been changed by NickBurch:
https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=7&rev2=8
Comment:
Multiple Parser multiple-metadata discussions from the mailing list
== Supplementary/Additive ==
Concatenate the results (metadata and content) for several parsers
+
+ For Metadata, this merging should be configurable if multiple parsers output the same Metadata Key, between:
+ * First/earliest parser's value(s) win
+ * Last/latest parser's value(s) win
+ * Capture all and return multiple values
+
+ ''TODO'' For Content, decide how we could support appending or resetting of the SAX stream
We need a better name for this!
@@ -79, +86 @@
<!-- JPEG needs special handling - try+combine everything -->
<parser class="org.apache.tika.parser.(suppliment)">
+ <params>
+ <!-- If several parsers output the same metadata key, first parser to do so wins -->
+ <param name="metadataPolicy" value="FIRST_WINS" />
+ <!-- If several parsers output the same metadata key, last parser to do so wins -->
+ <!--
+ <param name="metadataPolicy" value="LAST_WINS" />
+ -->
+ <!-- If several parsers output the same metadata key, store all their values -->
+ <!--
+ <param name="metadataPolicy" value="KEEP_ALL" />
+ -->
+ </params>
<parser class="org.apache.tika.parser.ocr.TesseractOCRParser" />
<parser class="org.apache.tika.parser.image.ImageParser" />
<parser class="org.apache.tika.parser.jpeg.JpegParser" />