You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2007/10/18 17:27:44 UTC

Mime type detection (Was: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.)

Hi,

On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> If I understand correctly, we already have what we need in MimeUtils:
>     public String getType(String typeName, String url, byte[] data) { ... }

The current MimeUtils.getType relies only on magic header matching,
and should be fixed.

The main reason why I decided to implement my own version of the code
based on MimeTypes in AutoDetectParser was that I was somewhat
confused about the separation of concerns across MimeTypes and
MimeUtils. The MimeTypes class already has a number of utility methods
like getMimeType(String, byte[]) and getMimeType(URL), so I'm not sure
why we need MimeUtils.

> Jukka, should I modify AutoDetectParser to call this method instead of its
> own?

OK once the method has been fixed.

> However, the bigger issue is, is the assessment that header based detection
> fails with certain file types correct?

Magic detection can never be 100% correct or complete, but there's a
lot that we could still do to improve the current status in Tika.

BR,

Jukka Zitting

Re: Mime type detection (Was: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.)

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.

Jukka & All -

It looks like the current getType() relies on the magic header matching only
when a type is returned based on the header.  Assuming it returns null and
not the DEFAULT type if it cannot recognize the header, I think this is how
it works:

If a type can be determined from the byte [] header, it is used.

Else, if a type can be determined from the type hint parameter, and that
type is consistent with the URL, it is used.

Else, if a type can be determined from the URL, it is used.

Is this the correct logic?

I've modified the documentation and some conditionals in the method so that
it is (IMHO) a little clearer.  I've attached a patch and a .txt file with
the method intact.  (Shall I commit this?)

http://www.nabble.com/file/p13278818/MimeUtils.patch MimeUtils.patch 
http://www.nabble.com/file/p13278818/MimeUtils.getMimeType.txt
MimeUtils.getMimeType.txt 

- Keith


Jukka Zitting wrote:
> 
> Hi,
> 
> On 10/18/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
>> If I understand correctly, we already have what we need in MimeUtils:
>>     public String getType(String typeName, String url, byte[] data) { ...
>> }
> 
> The current MimeUtils.getType relies only on magic header matching,
> and should be fixed.
> 
> The main reason why I decided to implement my own version of the code
> based on MimeTypes in AutoDetectParser was that I was somewhat
> confused about the separation of concerns across MimeTypes and
> MimeUtils. The MimeTypes class already has a number of utility methods
> like getMimeType(String, byte[]) and getMimeType(URL), so I'm not sure
> why we need MimeUtils.
> 
>> Jukka, should I modify AutoDetectParser to call this method instead of
>> its
>> own?
> 
> OK once the method has been fixed.
> 
>> However, the bigger issue is, is the assessment that header based
>> detection
>> fails with certain file types correct?
> 
> Magic detection can never be 100% correct or complete, but there's a
> lot that we could still do to improve the current status in Tika.
> 
> BR,
> 
> Jukka Zitting
> 
> 

-- 
View this message in context: http://www.nabble.com/Mime-type-detection-%28Was%3A--jira--Commented%3A-%28TIKA-79%29-Mime-type-detection-from-file-header-appears-to-be-failing.%29-tf4647810.html#a13278818
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: Mime type detection (Was: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.)

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Jukka,

 Great, I'm glad that you broke this out into a separate thread as I was
just about to respond to Keith's prior message.

> The current MimeUtils.getType relies only on magic header matching,
> and should be fixed.

+1, as per my suggestion, we should probably put something in there that
does something similar to the Nutch Content class changes that I just
committed in Nutch.

> 
> The main reason why I decided to implement my own version of the code
> based on MimeTypes in AutoDetectParser was that I was somewhat
> confused about the separation of concerns across MimeTypes and
> MimeUtils. The MimeTypes class already has a number of utility methods
> like getMimeType(String, byte[]) and getMimeType(URL), so I'm not sure
> why we need MimeUtils.

Good question. Originally I was uncertain of that myself as the code that I
got from Jerome originally had it. After looking through the code and trying
to understand it more myself (when I was originally committing it), I
decided that it makes sense to have MimeUtils as a decorator class to handle
instantiation of the MimeTypes repository (from a resourceName), and from a
given mime magic boolean flag. AFAIK, that's currently the only need for it.
It may make sense to simply move this capability down into MimeTypes and use
that class and remove MimeUtils altogether. If this is your suggestion, then
I'm +1 for it.

> 
>> Jukka, should I modify AutoDetectParser to call this method instead of its
>> own?
> 
> OK once the method has been fixed.

Well, more generally, once we all agree on what to do :)

> 
>> However, the bigger issue is, is the assessment that header based detection
>> fails with certain file types correct?
> 
> Magic detection can never be 100% correct or complete, but there's a
> lot that we could still do to improve the current status in Tika.

+1 for this. Mime detection/magic header/byte detection is not exactly a
science, but more a practice of heuristics and patterns picked up over time.
I think that the framework in Tika is one of the most comprehensive I've
seen. Additionally, the great thing about it is that it's extensible. If we
decide it's not doing a great job at detecting Keith's .ppt files, we can
add more byte headers to compare against by editing the tika-mimetypes.xml
file underneath the application/microsoft-powerpoint mime type.

Thanks!

Cheers,
  Chris


> 
> BR,
> 
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.