You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/29 19:16:18 UTC

[jira] [Updated] (TIKA-1865) Save sender email address in Outlook MSG metadata

     [ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1865:
------------------------------
    Attachment: report.xlsx

I took a dump of .msg files from Common Crawl.  Several of the files were truncated or not actually MSG files.  Many were "contacts" and not actual email files.  The corpus was limited...in many ways, ymmv.

I did three things:
1) dump the most obvious fields for sender email address (attached).  Finding: in general, this works well with SMTP emails; for Exchange, things get dicey.
2) For those emails with a header "From:" field, try to find all properties of type String (ascii or Unicode) that contained that email.  I was hoping this would identify new fields beyond the obvious ones...it didn't.
3) Find all property fields in Exchange emails that contained an email address and weren't a recipient chunk...I was hoping this would lead to common patterns for Exchange emails not already picked up by the known properties, but it didn't.

In short, I think the best bet to extract the sender's email address is the strategy that I recommended above.  I think we may also want to pull out the senders Exchange id (different metadata property!), because that could be useful as an identifier.

Finally, is there an easy way to tell if an msg file is a message, a post, an appointment or a contact?

> Save sender email address in Outlook MSG metadata
> -------------------------------------------------
>
>                 Key: TIKA-1865
>                 URL: https://issues.apache.org/jira/browse/TIKA-1865
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Windows 7 x64, jre 1.8.0_60 x64
>            Reporter: Luis Filipe Nassif
>         Attachments: report.xlsx
>
>
> Sender email address is lost when extracting metadata from Outlook msg files. Currently only sender name is extracted. That is an important information to be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)