You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tom Barber <to...@meteorite.bi> on 2014/09/07 22:28:18 UTC

MediaTypeRegistry normalize query

Hey guys

I was doing some stuff related to MimeTypes.getRegisteredMimeType and 
within that method it calls

registry.normalize(type)

now when parsing HTML files these days Tika adds the charset attribute 
to the string.

I would have thought the normalize call was designed to remove this 
because tika-mimetypes.xml surely isn't supposed to contain charset 
matching tags?

Anyway if you do

Tika.detect(myurl)

followed by

MimeTypes.getRegisteredMimeType("text/html; charset=UTF-8");

It returns null because it doesn't strip the charset, without it its fine.

Bug/Feature/Misunderstanding?

Regards

Tom
-- 
*Tom Barber* | Technical Director

meteorite bi
*T:* +44 20 8133 3730
*W:* www.meteorite.bi | *Skype:* meteorite.consulting
*A:* Surrey Technology Centre, Surrey Research Park, Guildford, GU2 7YG, UK