You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/06/02 03:47:15 UTC

mimetype magic vs globs

I'm running into problems with mimetype detection again.

I have a file named foo.xml .  It should be detected as application/ 
xml.  The thing is, within the first 64 bytes of each file is   
"<title>the title</title>".  Because of this, Tika (with the 0.4  
snapshot tika-mimetypes.xml) detects it as type/html, which is  
wrong.    Changing the magic priority of text/html to be either higher  
or lower than that of application/xml doesn't do anything.  The magic  
takes precedence over the glob pattern every time.

The easiest thing to do is just to edit tika-mimetypes.xml to remove  
the offending rule, which does work.   But this does make me wonder  
if  there is a way to tell Tika to match on the glob and then on the  
magic, instead of magic then glob?

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/