You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/06/29 17:36:49 UTC

[jira] Created: (TIKA-447) Container aware mimetype detection

Container aware mimetype detection
----------------------------------

                 Key: TIKA-447
                 URL: https://issues.apache.org/jira/browse/TIKA-447
             Project: Tika
          Issue Type: New Feature
          Components: mime
    Affects Versions: 0.7
            Reporter: Nick Burch
         Attachments: TikaContainerDetection.patch

As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.

This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it

Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894510#action_12894510 ] 

Nick Burch commented on TIKA-447:
---------------------------------

Alex - have a look at the code, I think it already does what you're asking of it :)

For OLE2, when we detect the ole2 signature, we load the file into POIFS. We then ask the detector what it is based on this

For Zip, we look at each entry in the zip file in turn. If it's one we recognise the name of, and that tells us all we need, we return. Otherwise, we open up that entry, and grab the mime type from that, and return.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Alex Ott (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894511#action_12894511 ] 

Alex Ott commented on TIKA-447:
-------------------------------

Ah, sorry Nick - I hadn't looked into code yet. I thought, that we stuck in container if it matches to some signature.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893260#action_12893260 ] 

Chris A. Mattmann commented on TIKA-447:
----------------------------------------

Nick, awesome!

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895192#action_12895192 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

I committed my patch in revision  982175.

> memory and processing impact of opening the container

I think this acceptable as the extra cost is only associated with specific media types, and we can use the open container feature you added to TikaInputStream to allow later parsing stages to avoid duplicating these costs. Also, since this functionality is now only triggered when the detector is passed a TikaInputStream, a performance-conscious user can easily prevent the extra processing. We might also want to add some extra flag for this if needed.

> detectors run in the right order

This was a part of my thinking behind the proposed getSupportedTypes() method. With that we could choose to only run these kinds of more complex detectors when simpler detectors have first identified the basic container format.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894509#action_12894509 ] 

Nick Burch commented on TIKA-447:
---------------------------------

Jukka - that might end up being more work though? Also, short of refactoring the current mime types to split out all the different bits, I'm not sure we will have that many new detectors ever?

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894486#action_12894486 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

It would be great if the AutoDetectParser could automatically leverage such detectors that use external parser libraries. The AutoDetectParser can't directly link to such parsers due to dependency issues, but we could use the service provider mechanism just like we do with Parser classes to automatically load all the Detectors available in the classpath. To do this effectively, I'd also add a Detector.getSupportedTypes() method like below so that more complex and potentially more expensive (need to read the entire document) detectors like POIFSContainerDetector could only be called if a more generic detector first determines that the input document matches the supported base type.

    /**
     * Returns the set of base media types supported by this detector
     * when used with the given parse context. The base media type can
     * be <code>application/octet-stream</code> for generic detectors
     * or a more specific type like <code>text/plain</code> or
     * <code>application/zip</code> for detectors that can only
     * distinguish between subtypes of that base type.
     *
     * @since Apache Tika 0.8
     * @param context parse context
     * @return immutable set of media types
     */
    Set<MediaType> getSupportedTypes(ParseContext context);


> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894520#action_12894520 ] 

Nick Burch commented on TIKA-447:
---------------------------------

Using the container aware detector will give a more accurate answer generally, but at the cost of more memory use, and longer processing time. (Oh, and plus the need for various parser dependencies)

There was some reluctance on-list about making this the default, due to the memory and processing impact of opening the container, which we'll need to take notice of. 

There's also the issue of making sure the detectors run in the right order, which may matter for some but not for others. Alas I don't have a good answer for the way to handle all these different needs...

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893182#action_12893182 ] 

Nick Burch commented on TIKA-447:
---------------------------------

As no-one has objected, I've committed this initial code in r980058.

With this commit, OLE2 based detection should be complete, and some Zip based detection is there, but some still remains to be added.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894502#action_12894502 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

Hmm, I guess you're right, perhaps we won't need such multi-level detector functionality. The alternative is to simply load all available Detectors, run them on the input document and finally select the most specific of the returned media types.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Alex Ott (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894501#action_12894501 ] 

Alex Ott commented on TIKA-447:
-------------------------------

2Nick: does this will allow to implement support for self-extracted archives? Because, if we'll implement this as separate checker, then we'll need to implement archive extraction/detection inside this checker - this could lead to code duplication.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894518#action_12894518 ] 

Jukka Zitting commented on TIKA-447:
------------------------------------

It's a bit more work, yes. What I'm trying to achieve here is for someone who just uses "new Tika().detect(...)" to be able to benefit from these extra detectors when they're available in the classpath.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893620#action_12893620 ] 

Chris A. Mattmann commented on TIKA-447:
----------------------------------------

Nick, awesome job! Comments below:

{quote}
I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection? 
{quote}

Yep, I would do this. I would just add some APT documentation and create a section called "Detection", with some useful information on there. You could also then from that APT page, link to the page on the Wiki where the discussion on container Metadata occurred too:

http://wiki.apache.org/tika/MetadataDiscussion

Cheers,
Chris



> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893610#action_12893610 ] 

Nick Burch commented on TIKA-447:
---------------------------------

I've added support for OOXML files (detection + container re-use), as well as Jar files

I believe the only zip based container format we can't currently detect with this is iWork. I've figured out how to tell it's an iWork document, but not how to tell which iWork document subtype it is.

I think the only bit left for now is to document it. We don't currently have a Detection section in the documentation. Shall I create a new one, put in the basics from one of the apachecon Tika talks, then add a section on container aware detection?

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-447) Container aware mimetype detection

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-447:
-------------------------------

    Attachment: TIKA-447-TikaInputStream.patch

BTW, the current new Detector implementations are a bit troublesome as they break the contract that the detect() method must not close() the given stream and should use mark() and reset() where necessary to avoid changing the state of the stream. The rationale behind this contract is that you should be able to call parse() on the same stream instance after detecting its type.

The attached patch fixes this issue by using the TikaInputStream.getFile() method to access the underlying file (when available or spooled) when detecting these kinds of complex container formats. If the given stream is not a TikaInputStream, then just the generic application/zip or application/x-tika-msoffice type is returned.

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894494#action_12894494 ] 

Nick Burch commented on TIKA-447:
---------------------------------

At the moment, the ContainerAwareDetector checks the first 8 bytes of the file. If they match the OLE2 header signature, it hands it off to POIFS. If the first 4 bytes match the zip header signature, it does zip checking. If neither of them match, it falls back to the default detector

To me, this seems simpler!

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Alex Ott (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894507#action_12894507 ] 

Alex Ott commented on TIKA-447:
-------------------------------

It's better to have some flag, that will say "Stop, if this rule matched", because applying of all rules, could lead to weak performance
It's better to have something like, for example for zips
 - rule for jar: zip-type == X1
 - rule for odf: zip-type == X2
.....

zip-type will calculated once on first invocation, and then re-used.  And all rules (for jar, odf, etc.) have no flag "Stop here", while there will rule for ordinary zip's, that will have this flag, and we'll stop after checking of all subtypes.
The same is could be implemented for OLE2 and other container formats, like OGG, etc.


> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898281#action_12898281 ] 

Nick Burch commented on TIKA-447:
---------------------------------

I've added some Detector documentation in r985242, please everyone dive in with bits I have missed!

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TIKA-447-TikaInputStream.patch, TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-447) Container aware mimetype detection

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-447:
----------------------------

    Attachment: TikaContainerDetection.patch

Patch which implements limited ole2 and odf detection by parsing the containers. May not be the best way to do it however...

> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to process container based formats (eg zip files and ole2 files) when trying to detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to do all the work of parsing the whole file when they're not interested in knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, we should try to keep the open container around to speed up parsing of the contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.