You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/10/17 18:57:59 UTC
[jira] [Resolved] (TIKA-2122) Extract all email headers from
Outlook .msg files into Metadata
[ https://issues.apache.org/jira/browse/TIKA-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2122.
-------------------------------
Resolution: Fixed
Went with {{Message:Raw-Header:}} as the prefix. I ran this against 1.7k .msg files we had in our regression corpus. There are some small areas for improvement, but, overall this looks good. I was able to reuse mime4j's {{DecoderUtil.decodeEncodedWords}} to handle encoded values.
> Extract all email headers from Outlook .msg files into Metadata
> ---------------------------------------------------------------
>
> Key: TIKA-2122
> URL: https://issues.apache.org/jira/browse/TIKA-2122
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.13
> Reporter: Chris Knott
> Priority: Minor
> Fix For: 2.0, 1.14
>
> Attachments: msg_raw_headers.xlsx
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Currently most email headers are not added to the Metadata when extracting Outlook .msg files.
> http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> The headers - {{msg.getHeaders()}} - are already being looped through as a way to estimate the date.
> All headers should be added to Metadata, using the name of the header with a prefix such as {{"raw-header:"}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)