You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Tucker Barbour <ba...@gmail.com> on 2017/08/10 09:56:00 UTC

Outlook For Mac (OLM) Parser?

I have recently encountered a case where I need to parse an Outlook For Mac email archive (OLM). I have not found an officially published specification for the file format but after a bit of inspection it appears to be similar to the OOXML format. It's a ZIP file containing emails in an XML format and references to binary attachments. I was curious if anyone has explored writing a Parser for OLM. As expected, the AutoDetectParser detects the Content-Type as application/zip and the PackageParser is invoked. This "works" but ideally I could parse an OLM similiar to other email archives such as PST or MBOX where embedded content is handled as emails rather than XML. Since the file format is similar to OOXML it might not be too hard to write a parser but was curious if anyone else might have already done some work in this area.

-Tucker

RE: Outlook For Mac (OLM) Parser?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Please open a ticket on our JIRA and share an example file.  We'll want to update our package detector to handle this format.  As for parsing, XML is doable, and I'd be happy to try my hand at it...if we can find enough examples...  Please no protobufs, please no protobufs... :)

-----Original Message-----
From: Tucker Barbour [mailto:barbct5@gmail.com] 
Sent: Thursday, August 10, 2017 5:56 AM
To: user@tika.apache.org
Subject: Outlook For Mac (OLM) Parser?

I have recently encountered a case where I need to parse an Outlook For Mac email archive (OLM). I have not found an officially published specification for the file format but after a bit of inspection it appears to be similar to the OOXML format. It's a ZIP file containing emails in an XML format and references to binary attachments. I was curious if anyone has explored writing a Parser for OLM. As expected, the AutoDetectParser detects the Content-Type as application/zip and the PackageParser is invoked. This "works" but ideally I could parse an OLM similiar to other email archives such as PST or MBOX where embedded content is handled as emails rather than XML. Since the file format is similar to OOXML it might not be too hard to write a parser but was curious if anyone else might have already done some work in this area.

-Tucker