You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Mike Hugo <mi...@piragua.com> on 2013/07/26 18:11:17 UTC
MS OneNote
Hello,
I'm looking into basic support (text extraction) for MS OneNote. I found
this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
some sample files attached. Does anyone have any pointers as to where I
should get started?
I have downloaded the POI source code and have it compiling and tests
running and have read the MS OneNote specification. Any pointers on where
to begin in the POI code base (e.g. "parsing OneNote files might look
something like x.y.z.Class in the source"?
Thanks in advance!
Mike
Re: MS OneNote
Posted by Mike Hugo <mi...@piragua.com>.
Thanks Nick!
On Jul 26, 2013, at 11:46 AM, Nick Burch <ap...@gagravarr.org> wrote:
> On Fri, 26 Jul 2013, Mike Hugo wrote:
>> I'm looking into basic support (text extraction) for MS OneNote. I found
>> this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
>> some sample files attached. Does anyone have any pointers as to where I
>> should get started?
>
> Use POIFSLister to work out if they have a single POIFS/OLE2 stream or multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look at the parts. If one, use POIFSViewer and docs and try to work out if it's streams of records (eg HSSF), nested records (HSLF, DDF), or streams (HWPF).
>
> Once you know that, try to do something to do a basic processing of the file structure. Then add some .dev. tools to print the structure (look at visio, outlook etc for an idea of how we've done that). Use your own dev tool to play with the structure more. Finally, flesh out the implementation to cover all the key bits, and write lots of unit tests!
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
Re: MS OneNote
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 26 Jul 2013, Mike Hugo wrote:
> I'm looking into basic support (text extraction) for MS OneNote. I found
> this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has
> some sample files attached. Does anyone have any pointers as to where I
> should get started?
Use POIFSLister to work out if they have a single POIFS/OLE2 stream or
multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look
at the parts. If one, use POIFSViewer and docs and try to work out if it's
streams of records (eg HSSF), nested records (HSLF, DDF), or streams
(HWPF).
Once you know that, try to do something to do a basic processing of the
file structure. Then add some .dev. tools to print the structure (look at
visio, outlook etc for an idea of how we've done that). Use your own dev
tool to play with the structure more. Finally, flesh out the
implementation to cover all the key bits, and write lots of unit tests!
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org