You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Doug <ds...@gmail.com> on 2012/06/18 22:55:55 UTC
Detection behavior for mime-types without criteria
I'm planning to use TIKA as part of a process for cataloging data on a
share drive. Based on the website and tika-mimetypes.xml, the type
detection looks pretty comprehensive. However, while browsing
tika-mimetypes.xml, I noticed that about half of the mime-types listed have
no associated glob, root-XML, or magic elements. Without this match
criteria, can TIKA ever actually detect a file of one of these types?
I browsed the detector source. It looks like it tries to match against
magic, then XML, then names/globs/patterns. If a mime-type doesn't have any
of these, can TIKA do anything with it? If so, why is it listed in the
tike-mimetypes.xml file?
Thank you
Doug
Re: Detection behavior for mime-types without criteria
Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 22 Jun 2012, Doug wrote:
> Are the mime-type patterns ANDed or ORed?
Depends how you write them!
This is an and:
<match value="7z" type="string" offset="0:1" >
<match value="0xBCAF271C" type="string" offset="2:5" />
</match>
This is an or:
<match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/>
<match value="MSWordDoc" type="string" offset="2112"/>
> If I have a glob and a magic pattern, does it require both in order to
> match on the type? Will one or the other work and which takes
> precedence? If I list a glob pattern first and it does not match (i.e.
> mis-labeled file), will it still check for the magic?
It's a little complicated. If the magic matches, then that'll be used, but
the glob can specialise. If no magic matches, then only the glob is used
Taking these fake examples
application/test *.test bytes1-4=TEST
application/test2 extends /test *.test2
application/test3 *.test3
TEST called foo.test -> application/test
TEST called foo.test2 -> application/test2
TEST called foo.test3 -> application/test (test3 is wrong hierarchy)
RANDOM called foo.test3 -> application/test3 (no magic, only glob)
Nick
Re: Detection behavior for mime-types without criteria
Posted by Doug <ds...@gmail.com>.
Are the mime-type patterns ANDed or ORed? If I have a glob and a magic
pattern, does it require both in order to match on the type? Will one or
the other work and which takes precedence? If I list a glob pattern first
and it does not match (i.e. mis-labeled file), will it still check for the
magic?
Based on information at
http://library.gnome.org/admin/system-admin-guide/stable/mimetypes-source-xml.html.en
AND
http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.18.html#id2653327
it looks like the mimetypes.xml file is intended to be used as follows:
glob first. If there are 0 or greater than 1 matches for mime-type, then
try a magic match (if one is provided). If no glob and no magic (or none
provided), default to text/plain or application/octet-stream.
*What about the case where there is a single glob match, but it was
sloppily applied and magic would have correctly typed the file? Can TIKA
save itself from making an error in this case? *I think I would prefer to
see it as magic first, then glob. I've already got the files in memory
anyway so seeking is not a problem....
Any additional insights are welcome.
Thanks
Doug
On Mon, Jun 18, 2012 at 5:11 PM, Nick Burch <ni...@alfresco.com> wrote:
> On Mon, 18 Jun 2012, Doug wrote:
>
>> I'm planning to use TIKA as part of a process for cataloging data on a
>> share drive. Based on the website and tika-mimetypes.xml, the type
>> detection looks pretty comprehensive. However, while browsing
>> tika-mimetypes.xml, I noticed that about half of the mime-types listed have
>> no associated glob, root-XML, or magic elements. Without this match
>> criteria, can TIKA ever actually detect a file of one of these types?
>>
>
> To be detected, Tika will need something to go on. That could be a glob, a
> XML root element, some magic, or even a combination of all of them.
>
>
> I browsed the detector source. It looks like it tries to match against
>> magic, then XML, then names/globs/patterns. If a mime-type doesn't have
>> any
>> of these, can TIKA do anything with it? If so, why is it listed in the
>> tike-mimetypes.xml file?
>>
>
> The tike-mimetypes.xml file is used for both detection and information.
> With those entries, we can tell you something about the mimetype, even if
> we can't always detect it
>
> Nick
>
Re: Detection behavior for mime-types without criteria
Posted by Nick Burch <ni...@alfresco.com>.
On Mon, 18 Jun 2012, Doug wrote:
> I'm planning to use TIKA as part of a process for cataloging data on a
> share drive. Based on the website and tika-mimetypes.xml, the type
> detection looks pretty comprehensive. However, while browsing
> tika-mimetypes.xml, I noticed that about half of the mime-types listed
> have no associated glob, root-XML, or magic elements. Without this match
> criteria, can TIKA ever actually detect a file of one of these types?
To be detected, Tika will need something to go on. That could be a glob,
a XML root element, some magic, or even a combination of all of them.
> I browsed the detector source. It looks like it tries to match against
> magic, then XML, then names/globs/patterns. If a mime-type doesn't have any
> of these, can TIKA do anything with it? If so, why is it listed in the
> tike-mimetypes.xml file?
The tike-mimetypes.xml file is used for both detection and information.
With those entries, we can tell you something about the mimetype, even if
we can't always detect it
Nick