You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "george.young2@baesystems.com" <ge...@baesystems.com> on 2014/06/16 12:43:06 UTC

UTF-16 encoded HTML files detected as plain/text

I can successfully detect valid html files in other encodings but when a valid file is encoded as UTF-16 it is identified as plain/text.  I can see that in tika-mimetypes.xml the UTF_16 BOMs are used to identify files as text/plain with a priority of 20 and *.html identification is set to a priority of 40. I'm not sure why this is the case.

I see the advice here is not to alter  tika-mimetypes.xml (and indeed that would be a pain to maintain) and suggests that custom-mimetypes.xml should be used for new file types. However, I want to overwrite the definition for the existing text/plain type to reduce the priority or remove the UTF-16 magic signs so my valid UTF-16 html files are correctly identified.

Is this possible or is there a better way to achieve my aim of correctly identifying my UTF-16 html files as I can with those in other encodings?

George







Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.