You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2019/11/24 17:21:07 UTC

Call for Microsoft OneNote experts for help on OneNote parsing in Tika

Dear Tika Devs:

I am working on a OneNote tika parser. And I'm at the point where I need
some help with some of the workings of OneNote documents.

Here is the project so far:

https://github.com/nddipiazza/onenote-parser-java

Basically I just need some help understanding some of the finer details of
the OneNote format and how to extract info from it.

https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document
https://stackoverflow.com/questions/59020176/onenote-not-able-to-find-all-the-property-ids-in-the-microsoft-documentation

If anyone has a moment, can you please drop in and peak at the source and
also see if you can answer my questions?

-Nicholas

Re: Call for Microsoft OneNote experts for help on OneNote parsing in Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 24 Nov 2019, Nicholas DiPiazza wrote:
> Basically I just need some help understanding some of the finer details of
> the OneNote format and how to extract info from it.
>
> https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document
> https://stackoverflow.com/questions/59020176/onenote-not-able-to-find-all-the-property-ids-in-the-microsoft-documentation

If you're having issues with implementing bits described in the specs, you 
might find it best to ping the Apache POI dev list for help. Most of the 
Apache people who've worked with the Microsoft binary file formats are 
there!

If you're finding gaps in the published Microsoft specifications, the best 
option is to contact the Microsoft docs team. They're really nice people! 
And they want to help! They can't always help, because some bits of the 
file formats are complicatedly not covered by the open specifications, but 
often they can.

For the case where properties are found "in the wild" but missing from the 
documentation, it's probably worth just dropping the Microsoft docs team 
an email 
<https://docs.microsoft.com/en-us/openspecs/dev_center/ms-devcentlp/a7729059-1a2f-4698-a995-c0c011df2580>
Link to the page of the docs you're following, give them the list of IDs 
you've found, and ask if it is expected that those IDs are missing. Based 
on past experience, they'll take a few days to find someone on the 
relevant team, and either come back with a "whoops, our bad, will be fixed 
in the next 1-2 releases of the docs" or "sorry, deliberately excluded 
for now"

Nick