You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Mike Hugo <mi...@piragua.com> on 2013/07/26 18:11:17 UTC

MS OneNote

Hello,
I'm looking into basic support (text extraction) for MS OneNote.  I found
this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
some sample files attached.  Does anyone have any pointers as to where I
should get started?
I have downloaded the POI source code and have it compiling and tests
running and have read the MS OneNote specification.  Any pointers on where
to begin in the POI code base (e.g. "parsing OneNote files might look
something like x.y.z.Class in the source"?

Thanks in advance!

Mike

Re: MS OneNote

Posted by Mike Hugo <mi...@piragua.com>.
Thanks Nick!

On Jul 26, 2013, at 11:46 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 26 Jul 2013, Mike Hugo wrote:
>> I'm looking into basic support (text extraction) for MS OneNote.  I found
>> this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
>> some sample files attached.  Does anyone have any pointers as to where I
>> should get started?
>
> Use POIFSLister to work out if they have a single POIFS/OLE2 stream or multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look at the parts. If one, use POIFSViewer and docs and try to work out if it's streams of records (eg HSSF), nested records (HSLF, DDF), or streams (HWPF).
>
> Once you know that, try to do something to do a basic processing of the file structure. Then add some .dev. tools to print the structure (look at visio, outlook etc for an idea of how we've done that). Use your own dev tool to play with the structure more. Finally, flesh out the implementation to cover all the key bits, and write lots of unit tests!
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: MS OneNote

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 26 Jul 2013, Mike Hugo wrote:
> I'm looking into basic support (text extraction) for MS OneNote.  I found
> this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
> some sample files attached.  Does anyone have any pointers as to where I
> should get started?

Use POIFSLister to work out if they have a single POIFS/OLE2 stream or 
multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look 
at the parts. If one, use POIFSViewer and docs and try to work out if it's 
streams of records (eg HSSF), nested records (HSLF, DDF), or streams 
(HWPF).

Once you know that, try to do something to do a basic processing of the 
file structure. Then add some .dev. tools to print the structure (look at 
visio, outlook etc for an idea of how we've done that). Use your own dev 
tool to play with the structure more. Finally, flesh out the 
implementation to cover all the key bits, and write lots of unit tests!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org