You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/08/02 10:18:16 UTC

[jira] Commented: (TIKA-447) Container aware mimetype detection

    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894486#action_12894486 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

It would be great if the AutoDetectParser could automatically leverage such detectors that use external parser libraries. The AutoDetectParser can't directly link to such parsers due to dependency issues, but we could use the service provider mechanism just like we do with Parser classes to automatically load all the Detectors available in the classpath. To do this effectively, I'd also add a Detector.getSupportedTypes() method like below so that more complex and potentially more expensive (need to read the entire document) detectors like POIFSContainerDetector could only be called if a more generic detector first determines that the input document matches the supported base type.

    /**
     * Returns the set of base media types supported by this detector
     * when used with the given parse context. The base media type can
     * be <code>application/octet-stream</code> for generic detectors
     * or a more specific type like <code>text/plain</code> or
     * <code>application/zip</code> for detectors that can only
     * distinguish between subtypes of that base type.
     *
     * @since Apache Tika 0.8
     * @param context parse context
     * @return immutable set of media types
     */
    Set<MediaType> getSupportedTypes(ParseContext context);


> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.