You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Doug <ds...@gmail.com> on 2012/06/18 22:55:55 UTC

Detection behavior for mime-types without criteria

I'm planning to use TIKA as part of a process for cataloging data on a
share drive. Based on the website and tika-mimetypes.xml, the type
detection looks pretty comprehensive. However, while browsing
tika-mimetypes.xml, I noticed that about half of the mime-types listed have
no associated glob, root-XML, or magic elements. Without this match
criteria, can TIKA ever actually detect a file of one of these types?

I browsed the detector source. It looks like it tries to match against
magic, then XML, then names/globs/patterns. If a mime-type doesn't have any
of these, can TIKA do anything with it? If so, why is it listed in the
tike-mimetypes.xml file?

Thank you

Doug

Re: Detection behavior for mime-types without criteria

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 22 Jun 2012, Doug wrote:
> Are the mime-type patterns ANDed or ORed?

Depends how you write them!

This is an and:
       <match value="7z" type="string" offset="0:1" >
         <match value="0xBCAF271C" type="string" offset="2:5" />
       </match>

This is an or:
       <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/>
       <match value="MSWordDoc" type="string" offset="2112"/>

> If I have a glob and a magic pattern, does it require both in order to 
> match on the type? Will one or the other work and which takes 
> precedence? If I list a glob pattern first and it does not match (i.e. 
> mis-labeled file), will it still check for the magic?

It's a little complicated. If the magic matches, then that'll be used, but 
the glob can specialise. If no magic matches, then only the glob is used

Taking these fake examples
   application/test    *.test bytes1-4=TEST
   application/test2 extends /test   *.test2
   application/test3   *.test3

TEST called foo.test -> application/test
TEST called foo.test2 -> application/test2
TEST called foo.test3 -> application/test (test3 is wrong hierarchy)
RANDOM called foo.test3 -> application/test3 (no magic, only glob)

Nick

Re: Detection behavior for mime-types without criteria

Posted by Doug <ds...@gmail.com>.
Are the mime-type patterns ANDed or ORed? If I have a glob and a magic
pattern, does it require both in order to match on the type? Will one or
the other work and which takes precedence? If I list a glob pattern first
and it does not match (i.e. mis-labeled file), will it still check for the
magic?

Based on information at

http://library.gnome.org/admin/system-admin-guide/stable/mimetypes-source-xml.html.en
AND
http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.18.html#id2653327

it looks like the mimetypes.xml file is intended to be used as follows:
glob first. If there are 0 or greater than 1 matches for mime-type, then
try a magic match (if one is provided). If no glob and no magic (or none
provided), default to text/plain or application/octet-stream.

*What about the case where there is a single glob match, but it was
sloppily applied and magic would have correctly typed the file? Can TIKA
save itself from making an error in this case? *I think I would prefer to
see it as magic first, then glob. I've already got the files in memory
anyway so seeking is not a problem....

Any additional insights are welcome.

Thanks

Doug


On Mon, Jun 18, 2012 at 5:11 PM, Nick Burch <ni...@alfresco.com> wrote:

> On Mon, 18 Jun 2012, Doug wrote:
>
>> I'm planning to use TIKA as part of a process for cataloging data on a
>> share drive. Based on the website and tika-mimetypes.xml, the type
>> detection looks pretty comprehensive. However, while browsing
>> tika-mimetypes.xml, I noticed that about half of the mime-types listed have
>> no associated glob, root-XML, or magic elements. Without this match
>> criteria, can TIKA ever actually detect a file of one of these types?
>>
>
> To be detected, Tika will need something to go on. That could be a glob, a
> XML root element, some magic, or even a combination of all of them.
>
>
>  I browsed the detector source. It looks like it tries to match against
>> magic, then XML, then names/globs/patterns. If a mime-type doesn't have
>> any
>> of these, can TIKA do anything with it? If so, why is it listed in the
>> tike-mimetypes.xml file?
>>
>
> The tike-mimetypes.xml file is used for both detection and information.
> With those entries, we can tell you something about the mimetype, even if
> we can't always detect it
>
> Nick
>

Re: Detection behavior for mime-types without criteria

Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 18 Jun 2012, Doug wrote:
> I'm planning to use TIKA as part of a process for cataloging data on a 
> share drive. Based on the website and tika-mimetypes.xml, the type 
> detection looks pretty comprehensive. However, while browsing 
> tika-mimetypes.xml, I noticed that about half of the mime-types listed 
> have no associated glob, root-XML, or magic elements. Without this match 
> criteria, can TIKA ever actually detect a file of one of these types?

To be detected, Tika will need something to go on. That could be a glob, 
a XML root element, some magic, or even a combination of all of them.

> I browsed the detector source. It looks like it tries to match against
> magic, then XML, then names/globs/patterns. If a mime-type doesn't have any
> of these, can TIKA do anything with it? If so, why is it listed in the
> tike-mimetypes.xml file?

The tike-mimetypes.xml file is used for both detection and information. 
With those entries, we can tell you something about the mimetype, even if 
we can't always detect it

Nick