You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Funbit (JIRA)" <ji...@apache.org> on 2019/06/20 02:01:00 UTC

[jira] [Created] (TIKA-2897) Invalid XHTML output for some OpenOffice files (created in LibreOffice Impress)

Funbit created TIKA-2897:
----------------------------

             Summary: Invalid XHTML output for some OpenOffice files (created in LibreOffice Impress)
                 Key: TIKA-2897
                 URL: https://issues.apache.org/jira/browse/TIKA-2897
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.21
         Environment: Command line to reproduce:

{color:#205081}java -jar tika-app.jar --xml Impress.odp{color}
            Reporter: Funbit
         Attachments: Impress.odp

The XHTML output produced by the Tika 1.21 is invalid for some LibreOffice documents. The sample document (created in LibreOffice 6.1.5) is attached.

Here is the sample output (the <p> tag is not closed, any XHTML parser will fail to parse that):

{{<p class="notes"><div/>}}
{{</notes><div><p>SECOND PAGE</p>}}
{{</div>}}
{{<div><ul> <li><p>Text on the second page</p>}}
{{</li>}}
{{</ul>}}
{{</div>}}
{{{color:#FF0000}<p class="notes">{color}<div/>}}
{{</notes></body></html>}}

 

Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)