You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Lee Carpenter (JIRA)" <ji...@apache.org> on 2018/06/26 14:02:01 UTC
[jira] [Commented] (TIKA-1509) Create configurable strategies for
composite parsers
[ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523768#comment-16523768 ]
Lee Carpenter commented on TIKA-1509:
-------------------------------------
To Luis' stream reset point I have a perfect test case. An Excel Macro file application/vnd.ms-excel.template.macroenabled.12 would use the org.apache.tika.parser.microsoft.ooxml.OOXMLParser parser, but if you look at the stream the "Magic" matches application/msexcel which would use the org.apache.tika.parser.microsoft.OfficeParser.
This was an email attachment and I was able to handle a "Fallback" but that requires re-sending the whole document back over to be parsed. So I would be interested in being able to implement a fallback parser. The one snag is that the "Magic" is the same for a number of different MS Office files, so they rely upon a valid content type or file extension.
Just some thoughts
> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
> Key: TIKA-1509
> URL: https://issues.apache.org/jira/browse/TIKA-1509
> Project: Tika
> Issue Type: Sub-task
> Reporter: Tim Allison
> Priority: Major
>
> Several parsers can handle the same mime type, and we are currently ordering which parser is chosen (roughly) by the alphabetic order of the parser class name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here: http://wiki.apache.org/tika/CompositeParserDiscussion
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)