You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Funbit (JIRA)" <ji...@apache.org> on 2019/06/20 02:01:00 UTC
[jira] [Created] (TIKA-2897) Invalid XHTML output for some
OpenOffice files (created in LibreOffice Impress)
Funbit created TIKA-2897:
----------------------------
Summary: Invalid XHTML output for some OpenOffice files (created in LibreOffice Impress)
Key: TIKA-2897
URL: https://issues.apache.org/jira/browse/TIKA-2897
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.21
Environment: Command line to reproduce:
{color:#205081}java -jar tika-app.jar --xml Impress.odp{color}
Reporter: Funbit
Attachments: Impress.odp
The XHTML output produced by the Tika 1.21 is invalid for some LibreOffice documents. The sample document (created in LibreOffice 6.1.5) is attached.
Here is the sample output (the <p> tag is not closed, any XHTML parser will fail to parse that):
{{<p class="notes"><div/>}}
{{</notes><div><p>SECOND PAGE</p>}}
{{</div>}}
{{<div><ul> <li><p>Text on the second page</p>}}
{{</li>}}
{{</ul>}}
{{</div>}}
{{{color:#FF0000}<p class="notes">{color}<div/>}}
{{</notes></body></html>}}
Thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)