You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Gordon Vidal (Jira)" <ji...@apache.org> on 2022/08/03 08:55:00 UTC

[jira] [Created] (TIKA-3828) OneNote Parser - Parsed Files are Missing Parts of the Content

Gordon Vidal created TIKA-3828:
----------------------------------

             Summary: OneNote Parser - Parsed Files are Missing Parts of the Content
                 Key: TIKA-3828
                 URL: https://issues.apache.org/jira/browse/TIKA-3828
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.28.4, 2.4.1
            Reporter: Gordon Vidal
         Attachments: TestSection1 (1).one, TikaParserErrorScreenshot.png

OneNote files that I receive from Sharepoint Online are currently not parsed correctly. See the attached screenshot and OneNote section file.

I have been able to consistently reproduce this issue doing the following:
 * Create a OneNote Document with multiple sections.  
 * Edit the OneNote Document using the option "Open in Desktop App" and make changes in different sections, saving between edits. I have used both OneNote 2016 (Version 1808) and OneNote 2021 (Version 2108).
 * Download a section of the OneNote Document using the Sharepoint Online REST API

I will be investigating this issue myself as well. The Tika codebase is quite new to me so any information about the status of this bug, the potential cause and any plans to fix it would be very welcome. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)