You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Lee Carpenter (JIRA)" <ji...@apache.org> on 2018/06/26 14:02:01 UTC

[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

    [ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523768#comment-16523768 ] 

Lee Carpenter commented on TIKA-1509:
-------------------------------------

To Luis' stream reset point I have a perfect test case. An Excel Macro file application/vnd.ms-excel.template.macroenabled.12 would use the org.apache.tika.parser.microsoft.ooxml.OOXMLParser parser, but if you look at the stream the "Magic" matches application/msexcel which would use the org.apache.tika.parser.microsoft.OfficeParser. 

This was an email attachment and I was able to handle a "Fallback" but that requires re-sending the whole document back over to be parsed. So I would be interested in being able to implement a fallback parser. The one snag is that the "Magic" is the same for a number of different MS Office files, so they rely upon a valid content type or file extension.

Just some thoughts

> Create configurable strategies for composite parsers
> ----------------------------------------------------
>
>                 Key: TIKA-1509
>                 URL: https://issues.apache.org/jira/browse/TIKA-1509
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Tim Allison
>            Priority: Major
>
> Several parsers can handle the same mime type, and we are currently ordering which parser is chosen (roughly) by the alphabetic order of the parser class name.
> Let's allow users to configure strategies for picking parsers.
> See and contribute to full discussion here: http://wiki.apache.org/tika/CompositeParserDiscussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)