You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tom Barber <to...@meteorite.bi> on 2014/09/07 22:28:18 UTC
MediaTypeRegistry normalize query
Hey guys
I was doing some stuff related to MimeTypes.getRegisteredMimeType and
within that method it calls
registry.normalize(type)
now when parsing HTML files these days Tika adds the charset attribute
to the string.
I would have thought the normalize call was designed to remove this
because tika-mimetypes.xml surely isn't supposed to contain charset
matching tags?
Anyway if you do
Tika.detect(myurl)
followed by
MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");
It returns null because it doesn't strip the charset, without it its fine.
Bug/Feature/Misunderstanding?
Regards
Tom
--
*Tom Barber* | Technical Director
meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK