You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Joe Wicentowski <jo...@gmail.com> on 2012/06/25 22:14:41 UTC

Problem extracting date from Outlook 2007 .msg file

Hi all,

Hello!  This is my message to the list.  I'm building an application
that uses Tika to extract text from Outlook 2007 .msg files, among
other things.  While experimenting with some sample .msg files, I
noticed that Tika is failing not returning the date of most messages.
For example, Outlook indicates that the following message was sent on
"Fri 6/22/2012 8:11 AM", but no date appears in the HTML head or in
the early portion of the body of the Tika output [1].  I retrieved
this using Tika 1.1 on Windows XP using the following command:

  java -jar tika-app-1.1.jar "C:\Documents and
Settings\wicentowskijc\Desktop\portal\outlook\RE  Inquiry.msg" >
inquiry.html

If anyone has suggestions for ensuring that the date can be preserved
in Tika's output, I'd be grateful.

Thanks,
Joe


[1] Tika output showing no date

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="Message-Bcc" content="" />
        <meta name="subject" content="Inquiry" />
        <meta name="Content-Length" content="40960" />
        <meta name="Message-Recipient-Address" content="snip@gmail.com" />
        <meta name="Message-From" content="History Mailbox" />
        <meta name="Author" content="History Mailbox" />
        <meta name="Message-To" content="'Snip'" />
        <meta name="Message-Cc" content="" />
        <meta name="Content-Type" content="application/vnd.ms-outlook" />
        <meta name="resourceName" content="RE  Inquiry.msg" />
    </head>
    <body>
        <h1>RE: Inquiry</h1>
        <dl>
            <dt>From</dt>
            <dd>History Mailbox</dd>
            <dt>To</dt>
            <dd>'Snip'</dd>
            <dt>Recipients</dt>
            <dd>snip@gmail.com</dd>
        </dl>
        <p>Dear Snip</p>
...

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi all,

Claudius Teodorescu has kindly created a patch that fixes the bug I
reported with POI HSMF's handling of dates in Outlook 2007 .msg files.
 Please see https://issues.apache.org/bugzilla/show_bug.cgi?id=53784#c3.

Please advise us on whether we submitted the patch in a format that
will allow the POI devs to review it for incorporation into POI.

Thanks in advance,
Joe


On Thu, Aug 30, 2012 at 1:41 AM, Joe Wicentowski <jo...@gmail.com> wrote:
> Hi all,
>
> I have now created reproducible tests to illustrate the problem I'm
> having with POI HSMF's handling of dates in Outlook 2007 files.   I've
> posted the tests in the bug report I filed:
>
>   https://issues.apache.org/bugzilla/show_bug.cgi?id=53784

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi all,

I have now created reproducible tests to illustrate the problem I'm
having with POI HSMF's handling of dates in Outlook 2007 files.   I've
posted the tests in the bug report I filed:

  https://issues.apache.org/bugzilla/show_bug.cgi?id=53784

Nick Burch kindly added some comments to the bug report suggesting the
path to a solution.  I'd welcome any assistance - and if you'd like to
take this on for pay, please contact me off list with an estimate.

Thanks,
Joe

p.s. If there are other forums besides this for reaching talented POI
developers who would be willing to send an estimate, please point me
there!


On Tue, Aug 21, 2012 at 8:50 PM, Joe Wicentowski <jo...@gmail.com> wrote:
> Hi Dave,
>
> I would happily accept quotes for the job; please send quotes to me off list.
>
> Thanks,
> Joe
>
> Sent from my iPad
>
> On Aug 21, 2012, at 8:12 PM, Dave Fisher <da...@comcast.net> wrote:
>
>> Hi Joe,
>>
>> Are you looking to pay this person to help or are you looking for someone with the same "itch" as you?
>>
>> (Not that I am volunteering either way - it's not my area.)
>>
>> Regards,
>> Dave
>>
>> On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:
>>
>>> Hi all,
>>>
>>> I hadn't heard from anyone about the question I posed last week --
>>> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
>>> Is there a better forum for me to post this?  Should I file a bug?
>>> Ideally, I'd like to find someone who can help complete the fix that
>>> Nick Burch began in POI's SVN trunk.
>>>
>>> Thanks for any pointers about the best way to proceed,
>>> Joe
>>>
>>> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <jo...@gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> Hello!  This is my message to the list.  I'm building an application
>>>> that relies on Tika to extract text from Outlook 2007 .msg files.
>>>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>>>> list about a problem: HSMF is not pulling out the date of many of my
>>>> Outlook messages.
>>>>
>>>> For example, when I look at one of my message files (.msg) in Outlook,
>>>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>>>> I process the same message with Tika, no date appears in the output
>>>> [1].
>>>>
>>>> In comparison, I tried using a different tool, ruby-msg
>>>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>>>> ruby-msg did pull out the date [2].  This experiment shows that the
>>>> email *is* in the .msg file, and that Tika is failing to pick it up.
>>>>
>>>> Nick Burch from the Tika mailing list took a close, hands-on look at
>>>> my .msg file, determined the cause, and outlined a path to the fix:
>>>>
>>>>> I think I've figured out what's wrong. It looks like outlook stores
>>>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>>>> which we weren't processing.
>>>>>
>>>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>>>> writeProperties, then add unit tests. See:
>>>>>
>>>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>>>>
>>>>> When that's all done and working, then
>>>>> the final step is to update MAPIMessage to read some of the values as needed
>>>>> out of the properties
>>>>>
>>>>> The info I've been working with comes from this blog post:
>>>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>>>>
>>>>> (That links into suitable bits of the public documentation)
>>>>>
>>>>> I suspect it's under a day's work. I've put in place the basics, just needs someone to flesh it out.
>>>>
>>>> While Nick kindly tracked down the cause, unfortunately I lack the
>>>> java chops to complete the solution.
>>>>
>>>> Would anyone here be kind enough to assist me with this?
>>>>
>>>> I'm happy to test any attempted fixes, and I'm happy to provide more
>>>> info, like sample Outlook files (.msg files).  My hope is that this
>>>> fix will allow POI to "just work" for more users who are evaluating
>>>> it.
>>>>
>>>> Thank you in advance,
>>>> Joe
>>>>
>>>>
>>>> [1] Tika output showing no date, retrieved via the following command:
>>>>
>>>>  java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>>>>
>>>> <html xmlns="http://www.w3.org/1999/xhtml">
>>>>   <head>
>>>>       <meta name="Message-Bcc" content="" />
>>>>       <meta name="subject" content="Inquiry" />
>>>>       <meta name="Content-Length" content="40960" />
>>>>       <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>>>       <meta name="Message-From" content="History Mailbox" />
>>>>       <meta name="Author" content="History Mailbox" />
>>>>       <meta name="Message-To" content="'Snip'" />
>>>>       <meta name="Message-Cc" content="" />
>>>>       <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>>>       <meta name="resourceName" content="RE  Inquiry.msg" />
>>>>   </head>
>>>>   <body>
>>>>       <h1>RE: Inquiry</h1>
>>>>       <dl>
>>>>           <dt>From</dt>
>>>>           <dd>History Mailbox</dd>
>>>>           <dt>To</dt>
>>>>           <dd>'Snip'</dd>
>>>>           <dt>Recipients</dt>
>>>>           <dd>snip@gmail.com</dd>
>>>>       </dl>
>>>>       <p>Dear Snip</p>
>>>> ...
>>>>
>>>> [2] The ruby-msg output -- notice the "Date:" line:
>>>>
>>>> From: "History Mailbox" <re...@removed.com>
>>>> To: "Snip" <sn...@gmail.com>
>>>> Subject: RE: Inquiry
>>>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>>>> Message-ID: <00...@PASA1MB01.pace.unc>
>>>> In-Reply-To: <CA...@mail.gmail.com>
>>>> Priority: 0
>>>> Thread-Topic: Inquiry
>>>> Content-Type: multipart/alternative;
>>>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi Dave,

I would happily accept quotes for the job; please send quotes to me off list.

Thanks,
Joe

Sent from my iPad

On Aug 21, 2012, at 8:12 PM, Dave Fisher <da...@comcast.net> wrote:

> Hi Joe,
> 
> Are you looking to pay this person to help or are you looking for someone with the same "itch" as you?
> 
> (Not that I am volunteering either way - it's not my area.)
> 
> Regards,
> Dave
> 
> On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:
> 
>> Hi all,
>> 
>> I hadn't heard from anyone about the question I posed last week --
>> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
>> Is there a better forum for me to post this?  Should I file a bug?
>> Ideally, I'd like to find someone who can help complete the fix that
>> Nick Burch began in POI's SVN trunk.
>> 
>> Thanks for any pointers about the best way to proceed,
>> Joe
>> 
>> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <jo...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> Hello!  This is my message to the list.  I'm building an application
>>> that relies on Tika to extract text from Outlook 2007 .msg files.
>>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>>> list about a problem: HSMF is not pulling out the date of many of my
>>> Outlook messages.
>>> 
>>> For example, when I look at one of my message files (.msg) in Outlook,
>>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>>> I process the same message with Tika, no date appears in the output
>>> [1].
>>> 
>>> In comparison, I tried using a different tool, ruby-msg
>>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>>> ruby-msg did pull out the date [2].  This experiment shows that the
>>> email *is* in the .msg file, and that Tika is failing to pick it up.
>>> 
>>> Nick Burch from the Tika mailing list took a close, hands-on look at
>>> my .msg file, determined the cause, and outlined a path to the fix:
>>> 
>>>> I think I've figured out what's wrong. It looks like outlook stores
>>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>>> which we weren't processing.
>>>> 
>>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>>> writeProperties, then add unit tests. See:
>>>> 
>>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>>> 
>>>> When that's all done and working, then
>>>> the final step is to update MAPIMessage to read some of the values as needed
>>>> out of the properties
>>>> 
>>>> The info I've been working with comes from this blog post:
>>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>>> 
>>>> (That links into suitable bits of the public documentation)
>>>> 
>>>> I suspect it's under a day's work. I've put in place the basics, just needs someone to flesh it out.
>>> 
>>> While Nick kindly tracked down the cause, unfortunately I lack the
>>> java chops to complete the solution.
>>> 
>>> Would anyone here be kind enough to assist me with this?
>>> 
>>> I'm happy to test any attempted fixes, and I'm happy to provide more
>>> info, like sample Outlook files (.msg files).  My hope is that this
>>> fix will allow POI to "just work" for more users who are evaluating
>>> it.
>>> 
>>> Thank you in advance,
>>> Joe
>>> 
>>> 
>>> [1] Tika output showing no date, retrieved via the following command:
>>> 
>>>  java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>>> 
>>> <html xmlns="http://www.w3.org/1999/xhtml">
>>>   <head>
>>>       <meta name="Message-Bcc" content="" />
>>>       <meta name="subject" content="Inquiry" />
>>>       <meta name="Content-Length" content="40960" />
>>>       <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>>       <meta name="Message-From" content="History Mailbox" />
>>>       <meta name="Author" content="History Mailbox" />
>>>       <meta name="Message-To" content="'Snip'" />
>>>       <meta name="Message-Cc" content="" />
>>>       <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>>       <meta name="resourceName" content="RE  Inquiry.msg" />
>>>   </head>
>>>   <body>
>>>       <h1>RE: Inquiry</h1>
>>>       <dl>
>>>           <dt>From</dt>
>>>           <dd>History Mailbox</dd>
>>>           <dt>To</dt>
>>>           <dd>'Snip'</dd>
>>>           <dt>Recipients</dt>
>>>           <dd>snip@gmail.com</dd>
>>>       </dl>
>>>       <p>Dear Snip</p>
>>> ...
>>> 
>>> [2] The ruby-msg output -- notice the "Date:" line:
>>> 
>>> From: "History Mailbox" <re...@removed.com>
>>> To: "Snip" <sn...@gmail.com>
>>> Subject: RE: Inquiry
>>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>>> Message-ID: <00...@PASA1MB01.pace.unc>
>>> In-Reply-To: <CA...@mail.gmail.com>
>>> Priority: 0
>>> Thread-Topic: Inquiry
>>> Content-Type: multipart/alternative;
>>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Problem extracting date from Outlook 2007 .msg file

Posted by Dave Fisher <da...@comcast.net>.
Hi Joe,

Are you looking to pay this person to help or are you looking for someone with the same "itch" as you?

(Not that I am volunteering either way - it's not my area.)

Regards,
Dave

On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:

> Hi all,
> 
> I hadn't heard from anyone about the question I posed last week --
> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
> Is there a better forum for me to post this?  Should I file a bug?
> Ideally, I'd like to find someone who can help complete the fix that
> Nick Burch began in POI's SVN trunk.
> 
> Thanks for any pointers about the best way to proceed,
> Joe
> 
> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <jo...@gmail.com> wrote:
>> Hi all,
>> 
>> Hello!  This is my message to the list.  I'm building an application
>> that relies on Tika to extract text from Outlook 2007 .msg files.
>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>> list about a problem: HSMF is not pulling out the date of many of my
>> Outlook messages.
>> 
>> For example, when I look at one of my message files (.msg) in Outlook,
>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>> I process the same message with Tika, no date appears in the output
>> [1].
>> 
>> In comparison, I tried using a different tool, ruby-msg
>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>> ruby-msg did pull out the date [2].  This experiment shows that the
>> email *is* in the .msg file, and that Tika is failing to pick it up.
>> 
>> Nick Burch from the Tika mailing list took a close, hands-on look at
>> my .msg file, determined the cause, and outlined a path to the fix:
>> 
>>> I think I've figured out what's wrong. It looks like outlook stores
>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>> which we weren't processing.
>>> 
>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>> writeProperties, then add unit tests. See:
>>> 
>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>> 
>>> When that's all done and working, then
>>> the final step is to update MAPIMessage to read some of the values as needed
>>> out of the properties
>>> 
>>> The info I've been working with comes from this blog post:
>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>> 
>>> (That links into suitable bits of the public documentation)
>>> 
>>> I suspect it's under a day's work. I've put in place the basics, just needs someone to flesh it out.
>> 
>> While Nick kindly tracked down the cause, unfortunately I lack the
>> java chops to complete the solution.
>> 
>> Would anyone here be kind enough to assist me with this?
>> 
>> I'm happy to test any attempted fixes, and I'm happy to provide more
>> info, like sample Outlook files (.msg files).  My hope is that this
>> fix will allow POI to "just work" for more users who are evaluating
>> it.
>> 
>> Thank you in advance,
>> Joe
>> 
>> 
>> [1] Tika output showing no date, retrieved via the following command:
>> 
>>   java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>> 
>> <html xmlns="http://www.w3.org/1999/xhtml">
>>    <head>
>>        <meta name="Message-Bcc" content="" />
>>        <meta name="subject" content="Inquiry" />
>>        <meta name="Content-Length" content="40960" />
>>        <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>        <meta name="Message-From" content="History Mailbox" />
>>        <meta name="Author" content="History Mailbox" />
>>        <meta name="Message-To" content="'Snip'" />
>>        <meta name="Message-Cc" content="" />
>>        <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>        <meta name="resourceName" content="RE  Inquiry.msg" />
>>    </head>
>>    <body>
>>        <h1>RE: Inquiry</h1>
>>        <dl>
>>            <dt>From</dt>
>>            <dd>History Mailbox</dd>
>>            <dt>To</dt>
>>            <dd>'Snip'</dd>
>>            <dt>Recipients</dt>
>>            <dd>snip@gmail.com</dd>
>>        </dl>
>>        <p>Dear Snip</p>
>> ...
>> 
>> [2] The ruby-msg output -- notice the "Date:" line:
>> 
>> From: "History Mailbox" <re...@removed.com>
>> To: "Snip" <sn...@gmail.com>
>> Subject: RE: Inquiry
>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>> Message-ID: <00...@PASA1MB01.pace.unc>
>> In-Reply-To: <CA...@mail.gmail.com>
>> Priority: 0
>> Thread-Topic: Inquiry
>> Content-Type: multipart/alternative;
>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi all,

I hadn't heard from anyone about the question I posed last week --
regarding POI/HSMF's problems identifying dates in Outlook .msg files.
 Is there a better forum for me to post this?  Should I file a bug?
Ideally, I'd like to find someone who can help complete the fix that
Nick Burch began in POI's SVN trunk.

Thanks for any pointers about the best way to proceed,
Joe

On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <jo...@gmail.com> wrote:
> Hi all,
>
> Hello!  This is my message to the list.  I'm building an application
> that relies on Tika to extract text from Outlook 2007 .msg files.
> Tika relies on POI's HSMF libraries, which is why I'm writing to this
> list about a problem: HSMF is not pulling out the date of many of my
> Outlook messages.
>
> For example, when I look at one of my message files (.msg) in Outlook,
> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
> I process the same message with Tika, no date appears in the output
> [1].
>
> In comparison, I tried using a different tool, ruby-msg
> (http://code.google.com/p/ruby-msg/), to process the same message, and
> ruby-msg did pull out the date [2].  This experiment shows that the
> email *is* in the .msg file, and that Tika is failing to pick it up.
>
> Nick Burch from the Tika mailing list took a close, hands-on look at
> my .msg file, determined the cause, and outlined a path to the fix:
>
>> I think I've figured out what's wrong. It looks like outlook stores
>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>> which we weren't processing.
>>
>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>> PropertiesChunk, and fill in the TODOs for readProperties and
>> writeProperties, then add unit tests. See:
>>
>>  http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>
>> When that's all done and working, then
>> the final step is to update MAPIMessage to read some of the values as needed
>> out of the properties
>>
>> The info I've been working with comes from this blog post:
>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>
>> (That links into suitable bits of the public documentation)
>>
>> I suspect it's under a day's work. I've put in place the basics, just needs someone to flesh it out.
>
> While Nick kindly tracked down the cause, unfortunately I lack the
> java chops to complete the solution.
>
> Would anyone here be kind enough to assist me with this?
>
> I'm happy to test any attempted fixes, and I'm happy to provide more
> info, like sample Outlook files (.msg files).  My hope is that this
> fix will allow POI to "just work" for more users who are evaluating
> it.
>
> Thank you in advance,
> Joe
>
>
> [1] Tika output showing no date, retrieved via the following command:
>
>    java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>
> <html xmlns="http://www.w3.org/1999/xhtml">
>     <head>
>         <meta name="Message-Bcc" content="" />
>         <meta name="subject" content="Inquiry" />
>         <meta name="Content-Length" content="40960" />
>         <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>         <meta name="Message-From" content="History Mailbox" />
>         <meta name="Author" content="History Mailbox" />
>         <meta name="Message-To" content="'Snip'" />
>         <meta name="Message-Cc" content="" />
>         <meta name="Content-Type" content="application/vnd.ms-outlook" />
>         <meta name="resourceName" content="RE  Inquiry.msg" />
>     </head>
>     <body>
>         <h1>RE: Inquiry</h1>
>         <dl>
>             <dt>From</dt>
>             <dd>History Mailbox</dd>
>             <dt>To</dt>
>             <dd>'Snip'</dd>
>             <dt>Recipients</dt>
>             <dd>snip@gmail.com</dd>
>         </dl>
>         <p>Dear Snip</p>
> ...
>
> [2] The ruby-msg output -- notice the "Date:" line:
>
> From: "History Mailbox" <re...@removed.com>
> To: "Snip" <sn...@gmail.com>
> Subject: RE: Inquiry
> Date: Fri, 22 Jun 2012 12:11:00 -0000
> Message-ID: <00...@PASA1MB01.pace.unc>
> In-Reply-To: <CA...@mail.gmail.com>
> Priority: 0
> Thread-Topic: Inquiry
> Content-Type: multipart/alternative;
> boundary="----_=_NextPart_001_8149ed38.4fec8c61"

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi all,

Hello!  This is my message to the list.  I'm building an application
that relies on Tika to extract text from Outlook 2007 .msg files.
Tika relies on POI's HSMF libraries, which is why I'm writing to this
list about a problem: HSMF is not pulling out the date of many of my
Outlook messages.

For example, when I look at one of my message files (.msg) in Outlook,
it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
I process the same message with Tika, no date appears in the output
[1].

In comparison, I tried using a different tool, ruby-msg
(http://code.google.com/p/ruby-msg/), to process the same message, and
ruby-msg did pull out the date [2].  This experiment shows that the
email *is* in the .msg file, and that Tika is failing to pick it up.

Nick Burch from the Tika mailing list took a close, hands-on look at
my .msg file, determined the cause, and outlined a path to the fix:

> I think I've figured out what's wrong. It looks like outlook stores
> properties with a fixed size of 0-8 bytes in a different chunk in the file,
> which we weren't processing.
>
> If you wanted to tackle it, that'd be great! You'll want to take a look at
> PropertiesChunk, and fill in the TODOs for readProperties and
> writeProperties, then add unit tests. See:
>
>  http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>
> When that's all done and working, then
> the final step is to update MAPIMessage to read some of the values as needed
> out of the properties
>
> The info I've been working with comes from this blog post:
> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>
> (That links into suitable bits of the public documentation)
>
> I suspect it's under a day's work. I've put in place the basics, just needs someone to flesh it out.

While Nick kindly tracked down the cause, unfortunately I lack the
java chops to complete the solution.

Would anyone here be kind enough to assist me with this?

I'm happy to test any attempted fixes, and I'm happy to provide more
info, like sample Outlook files (.msg files).  My hope is that this
fix will allow POI to "just work" for more users who are evaluating
it.

Thank you in advance,
Joe


[1] Tika output showing no date, retrieved via the following command:

   java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta name="Message-Bcc" content="" />
        <meta name="subject" content="Inquiry" />
        <meta name="Content-Length" content="40960" />
        <meta name="Message-Recipient-Address" content="snip@gmail.com" />
        <meta name="Message-From" content="History Mailbox" />
        <meta name="Author" content="History Mailbox" />
        <meta name="Message-To" content="'Snip'" />
        <meta name="Message-Cc" content="" />
        <meta name="Content-Type" content="application/vnd.ms-outlook" />
        <meta name="resourceName" content="RE  Inquiry.msg" />
    </head>
    <body>
        <h1>RE: Inquiry</h1>
        <dl>
            <dt>From</dt>
            <dd>History Mailbox</dd>
            <dt>To</dt>
            <dd>'Snip'</dd>
            <dt>Recipients</dt>
            <dd>snip@gmail.com</dd>
        </dl>
        <p>Dear Snip</p>
...

[2] The ruby-msg output -- notice the "Date:" line:

From: "History Mailbox" <re...@removed.com>
To: "Snip" <sn...@gmail.com>
Subject: RE: Inquiry
Date: Fri, 22 Jun 2012 12:11:00 -0000
Message-ID: <00...@PASA1MB01.pace.unc>
In-Reply-To: <CA...@mail.gmail.com>
Priority: 0
Thread-Topic: Inquiry
Content-Type: multipart/alternative;
boundary="----_=_NextPart_001_8149ed38.4fec8c61"

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi again,

Incidentally, when I run the output of the ruby-msg through Tika, Tika
does get the date:

  <meta name="date" content="2012-06-22T00:11:00Z"/>

I guess I could use ruby-msg to pre-process Tika, but that defeats the
purpose of an all-in-one tool like Tika.

Joe

> I have an update regarding my report about Tika not recognizing the
> date in an Outlook .msg files [1].  I tried using a different tool,
> ruby-msg (http://code.google.com/p/ruby-msg/), to process the same
> message as in my earlier email, and ruby-msg did pull out the date [2]
>  This experiment shows that the email *is* in the .msg file, and that
> Tika is failing to pick it up.

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 28 Jun 2012, Joe Wicentowski wrote:
> Sounds promising.  Is there a command line way to run the HSMFDump
> component of the poi-scratchpad.jar on my .msg file?  If so, could you
> give me a pointer?

Something like this should work:
java -classpath poi-3.8-FINAL.jar:poi-scratchpad-3.8-FINAL.jar \
    org.apache.poi.hsmf.dev.HSMFDump <problem.msg>

Nick

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi Nick,

> That suggests that it's stored in a different bit of the file (a different
> stream) to the ones we're expecting to find it in. The file format is
> documented, so you can look up what each different bit means, but there are
> a lot of duplicate fields for historical reasons. What we lack is a guide
> saying "outlook 200x stores the sent date as MAPI_???_DATE, while 200y uses
> OUTLOOK_DATE_MAPI_???_V3"

I see.  This makes sense.

> What'd be great is if you could use org.apache.poi.hsmf.dev.HSMFDump
> (contained within the poi-scratchpad jar, dependency on the main poi jar but
> I don't think anything else) to try to track down which chunk contains the
> date. You might need to combine that with a little bit of hacking of your
> ruby script, to have it print some debug logging of what fields it's
> printing from
>
> Once we know the field, we can look up the details on how it's stored, then
> add a fallback check of that field/chunk too

Sounds promising.  Is there a command line way to run the HSMFDump
component of the poi-scratchpad.jar on my .msg file?  If so, could you
give me a pointer?  If not, I fear this may be getting beyond my
abilities, as I'm not a java programmer.  (I'm generally
unix-literate, work with Mac OS X; in terms of programming, I use
XQuery with eXist-db, which has incorporates Tika as an extension
module.)  I'm happy to provide the .msg file in question off list, if
it would help, but I'd understand if you aren't able to help to that
extent.

Thanks,
Joe

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 28 Jun 2012, Joe Wicentowski wrote:
> I have an update regarding my report about Tika not recognizing the date 
> in an Outlook .msg files [1].  I tried using a different tool, ruby-msg 
> (http://code.google.com/p/ruby-msg/), to process the same message as in 
> my earlier email, and ruby-msg did pull out the date [2] This experiment 
> shows that the email *is* in the .msg file, and that Tika is failing to 
> pick it up.

That suggests that it's stored in a different bit of the file (a different 
stream) to the ones we're expecting to find it in. The file format is 
documented, so you can look up what each different bit means, but there 
are a lot of duplicate fields for historical reasons. What we lack is a 
guide saying "outlook 200x stores the sent date as MAPI_???_DATE, while 
200y uses OUTLOOK_DATE_MAPI_???_V3"

What'd be great is if you could use org.apache.poi.hsmf.dev.HSMFDump 
(contained within the poi-scratchpad jar, dependency on the main poi jar 
but I don't think anything else) to try to track down which chunk contains 
the date. You might need to combine that with a little bit of hacking of 
your ruby script, to have it print some debug logging of what fields it's 
printing from

Once we know the field, we can look up the details on how it's stored, 
then add a fallback check of that field/chunk too

Nick

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi again,

I have an update regarding my report about Tika not recognizing the
date in an Outlook .msg files [1].  I tried using a different tool,
ruby-msg (http://code.google.com/p/ruby-msg/), to process the same
message as in my earlier email, and ruby-msg did pull out the date [2]
 This experiment shows that the email *is* in the .msg file, and that
Tika is failing to pick it up.

Can anyone suggest the best way to proceed to improve Tika's handling
of dates in Outlook .msg files?  I'll be happy to file a bug report,
but I'm just not sure whether this is an issue in Tika itself or in
one of Tika's dependencies.

Thanks,
Joe


[1] The Tika output, quoting from my last email:

> Author: PA History Mailbox
> Content-Length: 40960
> Content-Type: application/vnd.ms-outlook
> Message-Bcc:
> Message-Cc:
> Message-From: History Mailbox
> Message-Recipient-Address: snip@gmail.com
> Message-To: 'Snip'
> resourceName: RE  Inquiry.msg
> subject: Inquiry
> title: RE: Inquiry

[2] The ruby-msg output -- notice the "Date:" line:

From: "History Mailbox" <re...@removed.com>
To: "Snip" <sn...@gmail.com>
Subject: RE: Inquiry
Date: Fri, 22 Jun 2012 12:11:00 -0000
Message-ID: <00...@PASA1MB01.pace.unc>
In-Reply-To: <CA...@mail.gmail.com>
Priority: 0
Thread-Topic: Inquiry
Content-Type: multipart/alternative;
boundary="----_=_NextPart_001_8149ed38.4fec8c61"

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Joe Wicentowski <jo...@gmail.com>.
Hi Nick,

Thanks so much for your reply.

>> While experimenting with some sample .msg files, I
>> noticed that Tika is failing not returning the date of most messages.
>> For example, Outlook indicates that the following message was sent on
>> "Fri 6/22/2012 8:11 AM", but no date appears in the HTML head or in
>> the early portion of the body of the Tika output [1].  I retrieved
>> this using Tika 1.1 on Windows XP using the following command:
>
> Did you try with --metadata?

I ran tika with --metadata on the same message I mentioned in my first
email, and tika didn't output the message's date this way either.
Here are the results:

Author: PA History Mailbox
Content-Length: 40960
Content-Type: application/vnd.ms-outlook
Message-Bcc:
Message-Cc:
Message-From: History Mailbox
Message-Recipient-Address: snip@gmail.com
Message-To: 'Snip'
resourceName: RE  Inquiry.msg
subject: Inquiry
title: RE: Inquiry

> Also, are you sure that the messages contain the dates? Some kinds of
> outlook files don't...

This same message does show a date in Outlook ("Fri 6/22/2012 8:11
AM").  Do you know of some way to tell whether the date that appears
in Outlook is actually inside the message (versus stored elsewhere in
some sort of Outlook database)?  (In other mail clients I would think
to look at the "mail headers" mode, but I don't recall seeing such a
mode in Outlook.  Do you happen to know under what circumstances
Outlook would not include a date?

Tika does recognize dates in some of my sample messages, but
definitely this is the minority.  In fact, tika only retrieved dates
for 3 of 47 messages.  (Specifically, those 3 messages have the
following fields: date, Creation-Date, and Last-Save-Date.

Thanks for any suggestions,
Joe

Re: Problem extracting date from Outlook 2007 .msg file

Posted by Nick Burch <ni...@alfresco.com>.
On 25/06/12 21:14, Joe Wicentowski wrote:
> Hello!  This is my message to the list.  I'm building an application
> that uses Tika to extract text from Outlook 2007 .msg files, among
> other things.  While experimenting with some sample .msg files, I
> noticed that Tika is failing not returning the date of most messages.
> For example, Outlook indicates that the following message was sent on
> "Fri 6/22/2012 8:11 AM", but no date appears in the HTML head or in
> the early portion of the body of the Tika output [1].  I retrieved
> this using Tika 1.1 on Windows XP using the following command:

Did you try with --metadata?

Also, are you sure that the messages contain the dates? Some kinds of 
outlook files don't...

Nick