You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/06/02 03:47:15 UTC
mimetype magic vs globs
I'm running into problems with mimetype detection again.
I have a file named foo.xml . It should be detected as application/
xml. The thing is, within the first 64 bytes of each file is
"<title>the title</title>". Because of this, Tika (with the 0.4
snapshot tika-mimetypes.xml) detects it as type/html, which is
wrong. Changing the magic priority of text/html to be either higher
or lower than that of application/xml doesn't do anything. The magic
takes precedence over the glob pattern every time.
The easiest thing to do is just to edit tika-mimetypes.xml to remove
the offending rule, which does work. But this does make me wonder
if there is a way to tell Tika to match on the glob and then on the
magic, instead of magic then glob?
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/