You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by AJ Weber <aw...@comcast.net> on 2013/02/06 17:39:11 UTC

MHTML files

Anyone know if proper detection of MHT/MHTML files is on the roadmap for 
Tika?

I see that the format is a "close relative" of an outlook MSG file (it's 
got a mime-encapsulated format), and that's what Tika appears to think 
they are -- but they're not.

-AJ


Re: MHTML files

Posted by AJ Weber <aw...@comcast.net>.
Thanks again, M$FT. ;)

On 2/6/2013 12:07 PM, Nick Burch wrote:
> On Wed, 6 Feb 2013, AJ Weber wrote:
>> And if you try and open it with a text-editor it must be compressed 
>> somehow, because it's illegible.
>
> Based on the other findings you have, it's OLE2 based
>
>> java -jar ./tika-app-1.3.jar -m Receipt.mht
>> Author: Saved by Windows Internet Explorer 8
>> Content-Length: 43008
>> Content-Type: application/vnd.ms-outlook
>> Creation-Date: 2012-10-05T14:44:15Z
>> Last-Modified: 2012-10-05T14:44:15Z
>> Last-Save-Date: 2012-10-05T14:44:15Z
>> Message-Bcc:
>> Message-Cc:
>> Message-From: Saved by Windows Internet Explorer 8
>> Message-To:
>> creator: Saved by Windows Internet Explorer 8
>
> This does all look like it's a special kind of Outlook file.
>
> Nick

Re: MHTML files

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 6 Feb 2013, AJ Weber wrote:
> And if you try and open it with a text-editor it must be compressed somehow, 
> because it's illegible.

Based on the other findings you have, it's OLE2 based

> java -jar ./tika-app-1.3.jar -m Receipt.mht
> Author: Saved by Windows Internet Explorer 8
> Content-Length: 43008
> Content-Type: application/vnd.ms-outlook
> Creation-Date: 2012-10-05T14:44:15Z
> Last-Modified: 2012-10-05T14:44:15Z
> Last-Save-Date: 2012-10-05T14:44:15Z
> Message-Bcc:
> Message-Cc:
> Message-From: Saved by Windows Internet Explorer 8
> Message-To:
> creator: Saved by Windows Internet Explorer 8

This does all look like it's a special kind of Outlook file.

Nick

Re: MHTML files

Posted by AJ Weber <aw...@comcast.net>.
I have to check if the sample I have can be released publicly.  I can 
tell you that the only thing I've seen open it properly (so far) is 
actually MSFT Outlook, where it does say in the From metadata "Saved by 
Windows Internet Explorer 8".

When I ask Tika 1.3 to detect the format of this file it says:
java -jar ./tika-app-1.3.jar -d Receipt.mht
application/vnd.ms-outlook

And if you try and open it with a text-editor it must be compressed 
somehow, because it's illegible.
java -jar ./tika-app-1.3.jar -m Receipt.mht
Author: Saved by Windows Internet Explorer 8
Content-Length: 43008
Content-Type: application/vnd.ms-outlook
Creation-Date: 2012-10-05T14:44:15Z
Last-Modified: 2012-10-05T14:44:15Z
Last-Save-Date: 2012-10-05T14:44:15Z
Message-Bcc:
Message-Cc:
Message-From: Saved by Windows Internet Explorer 8
Message-To:
creator: Saved by Windows Internet Explorer 8
date: 2012-10-05T14:44:15Z
dc:creator: Saved by Windows Internet Explorer 8
dc:description: Receipt
dc:title: Receipt
dcterms:created: 2012-10-05T14:44:15Z
dcterms:modified: 2012-10-05T14:44:15Z
meta:author: Saved by Windows Internet Explorer 8
meta:creation-date: 2012-10-05T14:44:15Z
meta:save-date: 2012-10-05T14:44:15Z
modified: 2012-10-05T14:44:15Z
resourceName: Receipt.mht
subject: Receipt
title: Receipt

On 2/6/2013 11:49 AM, Nick Burch wrote:
> On Wed, 6 Feb 2013, AJ Weber wrote:
>> Anyone know if proper detection of MHT/MHTML files is on the roadmap 
>> for Tika?
>
> Tika can already detect MHTML files, and parse them. We have unit 
> tests for it.
>
> However, there might be more than one format using that extension...
>
>> I see that the format is a "close relative" of an outlook MSG file 
>> (it's got a mime-encapsulated format), and that's what Tika appears 
>> to think they are -- but they're not.
>
> None of the .mhtml files in the Tika test suite are anything like an 
> Outlook MSG file - they're all mbox / rfc822 style ones. (*.mht and 
> *.mhtml are both glob aliases of message/rfc822)
>
> Do you have a small sample file of your other kind of file? And do you 
> know what software generated it, and what that software calls the file 
> format?
>
> Nick

Re: MHTML files

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 6 Feb 2013, AJ Weber wrote:
> Anyone know if proper detection of MHT/MHTML files is on the roadmap for 
> Tika?

Tika can already detect MHTML files, and parse them. We have unit tests 
for it.

However, there might be more than one format using that extension...

> I see that the format is a "close relative" of an outlook MSG file (it's 
> got a mime-encapsulated format), and that's what Tika appears to think 
> they are -- but they're not.

None of the .mhtml files in the Tika test suite are anything like an 
Outlook MSG file - they're all mbox / rfc822 style ones. (*.mht and 
*.mhtml are both glob aliases of message/rfc822)

Do you have a small sample file of your other kind of file? And do you 
know what software generated it, and what that software calls the file 
format?

Nick