You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Adrian Conlon <Ad...@arup.com> on 2015/03/17 12:48:55 UTC

Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Hi List,

I submitted a bug report + patch a week or so ago, and I was wondering whether one of the committers could take a look and see whether it looks OK or not.

https://bz.apache.org/bugzilla/show_bug.cgi?id=57678

I realise that it isn't a major bug, but I'm using this as "testing the water" for making other bug fixes for HSMF. With that in mind, I'd appreciate pointers as to making accepting my changes as painless as possible for committers to take.

Thanks for looking!

Adrian
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Adrian Conlon <Ad...@arup.com>.
Thanks for the feedback re: #57678

As you suspected (I think!), the check I used was back to front.  Looking at the problem galvanised me to get POI/HSMF building under eclipse and the debugger, so I was able to step into the tests and see clearly how mad my original code was!

I've added test cases for the year transition, but I had to manually hack the data a little to generate the test cases.  Is this acceptable?  Modern emails just don't seem to include this property.  At least the addition of the extra test data highlighted the problem you were alluding to...

I'll have a closer look at PropertiesChunk.java and see what I can do now I've got Eclipse in a state I can work with...

Thanks again for the info!

Adrian

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: 22 March 2015 17:25
To: POI Developers List
Subject: RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

On Fri, 20 Mar 2015, Adrian Conlon wrote:
> I've only got a single file that shows the problem, the suggested 
> patch makes this email parse correctly and allows existing tests to 
> run.  I tried to follow the existing style of tests, but I'm not sure 
> how to create test data that transitions the year around the change-over date.

If you have a copy of outlook to hand, I'd suggest trying to set the date on your machine to something a few decades ago, and send yourself a message. Check that, then push the date forward or backwards a few years, send again. Repeat until you have a file either side of the changeover!


> I've been reading the HSMF source + MSG file specifications.  If I've 
> understood the basic format correctly (leaving to one side the various 
> recipient and attachment tables):
>
> 1) *all* properties are accessed via the "__properties_version1.0" 
> file

In Outlook, yes, and in theory, yes. Just not in HSMF (yet!)

> 2) fixed length properties are entirely contained in this file

Yes

> 3) variable length and array properties are read from various 
> "__substg1.0_" prefixed files

Variable length and array properties are stored in the
__properties_version1.0 chunk, but their values are in the __substg1.0_ files. I think something about the property and/or what's stored in the fixed length area of the property value somehow tells you what __substg1.0_ to go to for the real value, but that's the missing step


> So I'd guess from your comment that the "__properties_version1.0" file 
> isn't being properly handled at present.  Does that sound right?  I 
> note that PropertiesChunk.java exists and attempts to parse this file.  
> Is this where you think the problem lies?

The __properties_version1.0 is mostly being decoded. The challenge is to work out how to link a variable property stored in there to a __substg1.0_ chunk with the real value

I suspect that somewhere in the file format spec it'll explain that, I just haven't managed to find it since the spec was published (but I haven't had time to look very very hard...)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 20 Mar 2015, Adrian Conlon wrote:
> I've only got a single file that shows the problem, the suggested patch 
> makes this email parse correctly and allows existing tests to run.  I 
> tried to follow the existing style of tests, but I'm not sure how to 
> create test data that transitions the year around the change-over date.

If you have a copy of outlook to hand, I'd suggest trying to set the date 
on your machine to something a few decades ago, and send yourself a 
message. Check that, then push the date forward or backwards a few years, 
send again. Repeat until you have a file either side of the changeover!


> I've been reading the HSMF source + MSG file specifications.  If I've 
> understood the basic format correctly (leaving to one side the various 
> recipient and attachment tables):
>
> 1) *all* properties are accessed via the "__properties_version1.0" file

In Outlook, yes, and in theory, yes. Just not in HSMF (yet!)

> 2) fixed length properties are entirely contained in this file

Yes

> 3) variable length and array properties are read from various 
> "__substg1.0_" prefixed files

Variable length and array properties are stored in the 
__properties_version1.0 chunk, but their values are in the __substg1.0_ 
files. I think something about the property and/or what's stored in the 
fixed length area of the property value somehow tells you what 
__substg1.0_ to go to for the real value, but that's the missing step


> So I'd guess from your comment that the "__properties_version1.0" file 
> isn't being properly handled at present.  Does that sound right?  I note 
> that PropertiesChunk.java exists and attempts to parse this file.  Is 
> this where you think the problem lies?

The __properties_version1.0 is mostly being decoded. The challenge is to 
work out how to link a variable property stored in there to a __substg1.0_ 
chunk with the real value

I suspect that somewhere in the file format spec it'll explain that, I 
just haven't managed to find it since the spec was published (but I 
haven't had time to look very very hard...)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Adrian Conlon <Ad...@arup.com>.
Thanks Nick,

I've only got a single file that shows the problem, the suggested patch makes this email parse correctly and allows existing tests to run.  I tried to follow the existing style of tests, but I'm not sure how to create test data that transitions the year around the change-over date.

I've been reading the HSMF source + MSG file specifications.  If I've understood the basic format correctly (leaving to one side the various recipient and attachment tables):

1) *all* properties are accessed via the "__properties_version1.0" file
2) fixed length properties are entirely contained in this file
3) variable length and array properties are read from various "__substg1.0_" prefixed files

So I'd guess from your comment that the "__properties_version1.0" file isn't being properly handled at present.  Does that sound right?  I note that PropertiesChunk.java exists and attempts to parse this file.  Is this where you think the problem lies?

All very interesting!

Adrian


-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: 18 March 2015 18:06
To: POI Developers List
Subject: RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

On Wed, 18 Mar 2015, Adrian Conlon wrote:
> WRT test files, there's a test file associated with the bugzilla 
> entry, did you mean something more targeted than this?

We really want a files with a value of 69, 70 and 71, showing a mapping to 2069, 1970 and 1971 (assuming that's the right change-over date)

> I've added some extra information from the Microsoft open 
> specifications to the item which actually seems to make the use of 
> this header in POI date metadata extraction a bit suspect.  It might 
> be better to drop the use of the message submission chunk for date extraction moving forward.

Once we have something to replace it with, that'd be good. That needs better fixed-sized property support though!

> For moving HSMF forward, I'll have a think about what you said.  My 
> knowledge of the current code base is pretty slack.  Only the message 
> submission chunk really!  Is the current code built upon HPSF or POIFS?
> If so, I might start having a play with the Microsoft specs...

POIFS. The property decoding is of MAPI properties, not OLE2 ones. Much of the decoding is there, the missing steps are around matching a property in the property chunk up to a poifs chunk with the value in, and then how to tie it all together at read time nicely

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Mar 2015, Adrian Conlon wrote:
> WRT test files, there's a test file associated with the bugzilla entry, 
> did you mean something more targeted than this?

We really want a files with a value of 69, 70 and 71, showing a mapping to 
2069, 1970 and 1971 (assuming that's the right change-over date)

> I've added some extra information from the Microsoft open specifications 
> to the item which actually seems to make the use of this header in POI 
> date metadata extraction a bit suspect.  It might be better to drop the 
> use of the message submission chunk for date extraction moving forward.

Once we have something to replace it with, that'd be good. That needs 
better fixed-sized property support though!

> For moving HSMF forward, I'll have a think about what you said.  My 
> knowledge of the current code base is pretty slack.  Only the message 
> submission chunk really!  Is the current code built upon HPSF or POIFS? 
> If so, I might start having a play with the Microsoft specs...

POIFS. The property decoding is of MAPI properties, not OLE2 ones. Much of 
the decoding is there, the missing steps are around matching a property in 
the property chunk up to a poifs chunk with the value in, and then how to 
tie it all together at read time nicely

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Adrian Conlon <Ad...@arup.com>.
Thanks Nick,

WRT test files, there's a test file associated with the bugzilla entry, did you mean something more targeted than this?

I've added some extra information from the Microsoft open specifications to the item which actually seems to make the use of this header in POI date metadata extraction a bit suspect.  It might be better to drop the use of the message submission chunk for date extraction moving forward.

For moving HSMF forward, I'll have a think about what you said.  My knowledge of the current code base is pretty slack.  Only the message submission chunk really!  Is the current code built upon HPSF or POIFS?  If so, I might start having a play with the Microsoft specs...

Adrian



-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: 18 March 2015 12:47
To: POI Developers List
Subject: Re: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

On Tue, 17 Mar 2015, Adrian Conlon wrote:
> I submitted a bug report + patch a week or so ago, and I was wondering 
> whether one of the committers could take a look and see whether it 
> looks OK or not.
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=57678

Are you sure the logic for when to switch from 19xx to 20xx is right? If you could produce a test file and/or reference in the spec, that'd help!

> I realise that it isn't a major bug, but I'm using this as "testing 
> the water" for making other bug fixes for HSMF. With that in mind, I'd 
> appreciate pointers as to making accepting my changes as painless as 
> possible for committers to take.

HSMF started life before Microsoft released the file format specs, and is based around what we could figure out easily from hex dumps. It turns out that we got some key parts back-t-front. As such, pretty much only "variable length" properties are supported. While we do have some support for fixed length properties now (which actually cover most of them), we don't have a link between the properties in the propery table and their variable length chunks with their values in.

What it really needs is someone to spend some time with the spec, work out exactly how a variable length property in the properties chunk maps to a value chunk, and code up some logic to do that. With that in place, we can deprecate much of the current code driven by the value chunks, and replace it with a "proper" way of going via the properties list. That will also mean we can expose and use a lot more properties than we currently do, and possibly also avoid some hacky things like parsing string message headers to try to find dates

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 17 Mar 2015, Adrian Conlon wrote:
> I submitted a bug report + patch a week or so ago, and I was wondering 
> whether one of the committers could take a look and see whether it looks 
> OK or not.
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=57678

Are you sure the logic for when to switch from 19xx to 20xx is right? If 
you could produce a test file and/or reference in the spec, that'd help!

> I realise that it isn't a major bug, but I'm using this as "testing the 
> water" for making other bug fixes for HSMF. With that in mind, I'd 
> appreciate pointers as to making accepting my changes as painless as 
> possible for committers to take.

HSMF started life before Microsoft released the file format specs, and is 
based around what we could figure out easily from hex dumps. It turns out 
that we got some key parts back-t-front. As such, pretty much only 
"variable length" properties are supported. While we do have some support 
for fixed length properties now (which actually cover most of them), we 
don't have a link between the properties in the propery table and their 
variable length chunks with their values in.

What it really needs is someone to spend some time with the spec, work out 
exactly how a variable length property in the properties chunk maps to a 
value chunk, and code up some logic to do that. With that in place, we can 
deprecate much of the current code driven by the value chunks, and replace 
it with a "proper" way of going via the properties list. That will also 
mean we can expose and use a lot more properties than we currently do, and 
possibly also avoid some hacky things like parsing string message headers 
to try to find dates

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org