You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Adrian Conlon <Ad...@arup.com> on 2015/03/17 12:48:55 UTC
Bugzilla item #57678 (Incorrect year parsing in message submission
chunk)
Hi List,
I submitted a bug report + patch a week or so ago, and I was wondering whether one of the committers could take a look and see whether it looks OK or not.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57678
I realise that it isn't a major bug, but I'm using this as "testing the water" for making other bug fixes for HSMF. With that in mind, I'd appreciate pointers as to making accepting my changes as painless as possible for committers to take.
Thanks for looking!
Adrian
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
RE: Bugzilla item #57678 (Incorrect year parsing in message
submission chunk)
Posted by Adrian Conlon <Ad...@arup.com>.
Thanks for the feedback re: #57678
As you suspected (I think!), the check I used was back to front. Looking at the problem galvanised me to get POI/HSMF building under eclipse and the debugger, so I was able to step into the tests and see clearly how mad my original code was!
I've added test cases for the year transition, but I had to manually hack the data a little to generate the test cases. Is this acceptable? Modern emails just don't seem to include this property. At least the addition of the extra test data highlighted the problem you were alluding to...
I'll have a closer look at PropertiesChunk.java and see what I can do now I've got Eclipse in a state I can work with...
Thanks again for the info!
Adrian
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 22 March 2015 17:25
To: POI Developers List
Subject: RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)
On Fri, 20 Mar 2015, Adrian Conlon wrote:
> I've only got a single file that shows the problem, the suggested
> patch makes this email parse correctly and allows existing tests to
> run. I tried to follow the existing style of tests, but I'm not sure
> how to create test data that transitions the year around the change-over date.
If you have a copy of outlook to hand, I'd suggest trying to set the date on your machine to something a few decades ago, and send yourself a message. Check that, then push the date forward or backwards a few years, send again. Repeat until you have a file either side of the changeover!
> I've been reading the HSMF source + MSG file specifications. If I've
> understood the basic format correctly (leaving to one side the various
> recipient and attachment tables):
>
> 1) *all* properties are accessed via the "__properties_version1.0"
> file
In Outlook, yes, and in theory, yes. Just not in HSMF (yet!)
> 2) fixed length properties are entirely contained in this file
Yes
> 3) variable length and array properties are read from various
> "__substg1.0_" prefixed files
Variable length and array properties are stored in the
__properties_version1.0 chunk, but their values are in the __substg1.0_ files. I think something about the property and/or what's stored in the fixed length area of the property value somehow tells you what __substg1.0_ to go to for the real value, but that's the missing step
> So I'd guess from your comment that the "__properties_version1.0" file
> isn't being properly handled at present. Does that sound right? I
> note that PropertiesChunk.java exists and attempts to parse this file.
> Is this where you think the problem lies?
The __properties_version1.0 is mostly being decoded. The challenge is to work out how to link a variable property stored in there to a __substg1.0_ chunk with the real value
I suspect that somewhere in the file format spec it'll explain that, I just haven't managed to find it since the spec was published (but I haven't had time to look very very hard...)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
RE: Bugzilla item #57678 (Incorrect year parsing in message submission
chunk)
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 20 Mar 2015, Adrian Conlon wrote:
> I've only got a single file that shows the problem, the suggested patch
> makes this email parse correctly and allows existing tests to run. I
> tried to follow the existing style of tests, but I'm not sure how to
> create test data that transitions the year around the change-over date.
If you have a copy of outlook to hand, I'd suggest trying to set the date
on your machine to something a few decades ago, and send yourself a
message. Check that, then push the date forward or backwards a few years,
send again. Repeat until you have a file either side of the changeover!
> I've been reading the HSMF source + MSG file specifications. If I've
> understood the basic format correctly (leaving to one side the various
> recipient and attachment tables):
>
> 1) *all* properties are accessed via the "__properties_version1.0" file
In Outlook, yes, and in theory, yes. Just not in HSMF (yet!)
> 2) fixed length properties are entirely contained in this file
Yes
> 3) variable length and array properties are read from various
> "__substg1.0_" prefixed files
Variable length and array properties are stored in the
__properties_version1.0 chunk, but their values are in the __substg1.0_
files. I think something about the property and/or what's stored in the
fixed length area of the property value somehow tells you what
__substg1.0_ to go to for the real value, but that's the missing step
> So I'd guess from your comment that the "__properties_version1.0" file
> isn't being properly handled at present. Does that sound right? I note
> that PropertiesChunk.java exists and attempts to parse this file. Is
> this where you think the problem lies?
The __properties_version1.0 is mostly being decoded. The challenge is to
work out how to link a variable property stored in there to a __substg1.0_
chunk with the real value
I suspect that somewhere in the file format spec it'll explain that, I
just haven't managed to find it since the spec was published (but I
haven't had time to look very very hard...)
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
RE: Bugzilla item #57678 (Incorrect year parsing in message
submission chunk)
Posted by Adrian Conlon <Ad...@arup.com>.
Thanks Nick,
I've only got a single file that shows the problem, the suggested patch makes this email parse correctly and allows existing tests to run. I tried to follow the existing style of tests, but I'm not sure how to create test data that transitions the year around the change-over date.
I've been reading the HSMF source + MSG file specifications. If I've understood the basic format correctly (leaving to one side the various recipient and attachment tables):
1) *all* properties are accessed via the "__properties_version1.0" file
2) fixed length properties are entirely contained in this file
3) variable length and array properties are read from various "__substg1.0_" prefixed files
So I'd guess from your comment that the "__properties_version1.0" file isn't being properly handled at present. Does that sound right? I note that PropertiesChunk.java exists and attempts to parse this file. Is this where you think the problem lies?
All very interesting!
Adrian
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 18 March 2015 18:06
To: POI Developers List
Subject: RE: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)
On Wed, 18 Mar 2015, Adrian Conlon wrote:
> WRT test files, there's a test file associated with the bugzilla
> entry, did you mean something more targeted than this?
We really want a files with a value of 69, 70 and 71, showing a mapping to 2069, 1970 and 1971 (assuming that's the right change-over date)
> I've added some extra information from the Microsoft open
> specifications to the item which actually seems to make the use of
> this header in POI date metadata extraction a bit suspect. It might
> be better to drop the use of the message submission chunk for date extraction moving forward.
Once we have something to replace it with, that'd be good. That needs better fixed-sized property support though!
> For moving HSMF forward, I'll have a think about what you said. My
> knowledge of the current code base is pretty slack. Only the message
> submission chunk really! Is the current code built upon HPSF or POIFS?
> If so, I might start having a play with the Microsoft specs...
POIFS. The property decoding is of MAPI properties, not OLE2 ones. Much of the decoding is there, the missing steps are around matching a property in the property chunk up to a poifs chunk with the value in, and then how to tie it all together at read time nicely
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
RE: Bugzilla item #57678 (Incorrect year parsing in message submission
chunk)
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 18 Mar 2015, Adrian Conlon wrote:
> WRT test files, there's a test file associated with the bugzilla entry,
> did you mean something more targeted than this?
We really want a files with a value of 69, 70 and 71, showing a mapping to
2069, 1970 and 1971 (assuming that's the right change-over date)
> I've added some extra information from the Microsoft open specifications
> to the item which actually seems to make the use of this header in POI
> date metadata extraction a bit suspect. It might be better to drop the
> use of the message submission chunk for date extraction moving forward.
Once we have something to replace it with, that'd be good. That needs
better fixed-sized property support though!
> For moving HSMF forward, I'll have a think about what you said. My
> knowledge of the current code base is pretty slack. Only the message
> submission chunk really! Is the current code built upon HPSF or POIFS?
> If so, I might start having a play with the Microsoft specs...
POIFS. The property decoding is of MAPI properties, not OLE2 ones. Much of
the decoding is there, the missing steps are around matching a property in
the property chunk up to a poifs chunk with the value in, and then how to
tie it all together at read time nicely
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
RE: Bugzilla item #57678 (Incorrect year parsing in message
submission chunk)
Posted by Adrian Conlon <Ad...@arup.com>.
Thanks Nick,
WRT test files, there's a test file associated with the bugzilla entry, did you mean something more targeted than this?
I've added some extra information from the Microsoft open specifications to the item which actually seems to make the use of this header in POI date metadata extraction a bit suspect. It might be better to drop the use of the message submission chunk for date extraction moving forward.
For moving HSMF forward, I'll have a think about what you said. My knowledge of the current code base is pretty slack. Only the message submission chunk really! Is the current code built upon HPSF or POIFS? If so, I might start having a play with the Microsoft specs...
Adrian
-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org]
Sent: 18 March 2015 12:47
To: POI Developers List
Subject: Re: Bugzilla item #57678 (Incorrect year parsing in message submission chunk)
On Tue, 17 Mar 2015, Adrian Conlon wrote:
> I submitted a bug report + patch a week or so ago, and I was wondering
> whether one of the committers could take a look and see whether it
> looks OK or not.
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=57678
Are you sure the logic for when to switch from 19xx to 20xx is right? If you could produce a test file and/or reference in the spec, that'd help!
> I realise that it isn't a major bug, but I'm using this as "testing
> the water" for making other bug fixes for HSMF. With that in mind, I'd
> appreciate pointers as to making accepting my changes as painless as
> possible for committers to take.
HSMF started life before Microsoft released the file format specs, and is based around what we could figure out easily from hex dumps. It turns out that we got some key parts back-t-front. As such, pretty much only "variable length" properties are supported. While we do have some support for fixed length properties now (which actually cover most of them), we don't have a link between the properties in the propery table and their variable length chunks with their values in.
What it really needs is someone to spend some time with the spec, work out exactly how a variable length property in the properties chunk maps to a value chunk, and code up some logic to do that. With that in place, we can deprecate much of the current code driven by the value chunks, and replace it with a "proper" way of going via the properties list. That will also mean we can expose and use a lot more properties than we currently do, and possibly also avoid some hacky things like parsing string message headers to try to find dates
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For additional commands, e-mail: dev-help@poi.apache.org
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Re: Bugzilla item #57678 (Incorrect year parsing in message submission
chunk)
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 17 Mar 2015, Adrian Conlon wrote:
> I submitted a bug report + patch a week or so ago, and I was wondering
> whether one of the committers could take a look and see whether it looks
> OK or not.
>
> https://bz.apache.org/bugzilla/show_bug.cgi?id=57678
Are you sure the logic for when to switch from 19xx to 20xx is right? If
you could produce a test file and/or reference in the spec, that'd help!
> I realise that it isn't a major bug, but I'm using this as "testing the
> water" for making other bug fixes for HSMF. With that in mind, I'd
> appreciate pointers as to making accepting my changes as painless as
> possible for committers to take.
HSMF started life before Microsoft released the file format specs, and is
based around what we could figure out easily from hex dumps. It turns out
that we got some key parts back-t-front. As such, pretty much only
"variable length" properties are supported. While we do have some support
for fixed length properties now (which actually cover most of them), we
don't have a link between the properties in the propery table and their
variable length chunks with their values in.
What it really needs is someone to spend some time with the spec, work out
exactly how a variable length property in the properties chunk maps to a
value chunk, and code up some logic to do that. With that in place, we can
deprecate much of the current code driven by the value chunks, and replace
it with a "proper" way of going via the properties list. That will also
mean we can expose and use a lot more properties than we currently do, and
possibly also avoid some hacky things like parsing string message headers
to try to find dates
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org