You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by ge...@georgey.co.uk on 2014/06/17 10:03:43 UTC

Detecting html file which is urf-16 encoded

I want to be able to detect when a file is html even when it is utf-16
encoded. I can see from the default tika-mimetypes.xml that normally files
with a BOM will be detected as text/plain, which is the case.  I have
tried creating my own versions of the html and text mime types in a
custom-mimetypes.xml and these successfully overwrite the original ones but
changing the priority of these does not force the utf-16 files to be
identified as html. Even removing the BOM matches completely from the text
mimetype in the custom-mimetypes.xml does not work. 

So I tried another approach by removing the BOM from the inputstream before
detecting. However the utf-16 file is still not recognised as html, despite
the tect having multiple matches. It seems that the detect method does not
realise what encoding is being used for the file. Is there a way to tell a
detector what encoding a file is in to aid detection?

Thanks

George

Re: Detecting html file which is urf-16 encoded

Posted by Ken Krugler <kk...@transpac.com>.
Hi George,

One thing to try - in tika-mimetypes.xml, the entry for text/html has:

    <magic priority="40">
      <match value="&lt;!DOCTYPE HTML" type="string" offset="0:64"/>
      <match value="&lt;!doctype html" type="string" offset="0:64"/>
      <match value="&lt;HEAD" type="string" offset="0:64"/>
      <match value="&lt;head" type="string" offset="0:64"/>

(and so on)

Try replicating the <match> entries, but with type="little16" or type="big16" (depending on your file's encoding).

You might also need to remove the BOM from the input stream.

Let me know if that works. Feels like a Jira issue is warranted in either case...

-- Ken





On Jun 17, 2014, at 1:03am, george@georgey.co.uk wrote:

> I want to be able to detect when a file is html even when it is utf-16 encoded. I can see from the default tika-mimetypes.xml that normally files with a BOM will be detected as text/plain, which is the case.  I have tried creating my own versions of the html and text mime types in a custom-mimetypes.xml and these successfully overwrite the original ones but changing the priority of these does not force the utf-16 files to be identified as html. Even removing the BOM matches completely from the text mimetype in the custom-mimetypes.xml does not work. 
> 
> So I tried another approach by removing the BOM from the inputstream before detecting. However the utf-16 file is still not recognised as html, despite the tect having multiple matches. It seems that the detect method does not realise what encoding is being used for the file. Is there a way to tell a detector what encoding a file is in to aid detection?
> 
> Thanks
> 
> George
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr