You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Konstantin Gribov (JIRA)" <ji...@apache.org> on 2015/04/15 15:17:59 UTC

[jira] [Closed] (TIKA-1597) RTF with embedded image parsing produces div before html

     [ https://issues.apache.org/jira/browse/TIKA-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Gribov closed TIKA-1597.
-----------------------------------

> RTF with embedded image parsing produces div before html
> --------------------------------------------------------
>
>                 Key: TIKA-1597
>                 URL: https://issues.apache.org/jira/browse/TIKA-1597
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>         Environment: linux, oracle jdk 7u75
>            Reporter: Konstantin Gribov
>             Fix For: 1.8
>
>         Attachments: 2.rtf, 3.rtf
>
>
> On tika-1.8-rc1.
> {{java -jar tika-app/target/tika-app-1.8.jar -x 2.rtf}} returns
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><div xmlns="http://www.w3.org/1999/xhtml">HOHcvanAHTI'Imoc
> v8 Hanemnan npfiBOBafi "DRAW
> </div>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <!-- tail omitted -->
> {noformat}
> Removing image prevents such behavior ({{3.rtf}} doesn't contain embedded image).
> Update: you should have tesseract installed to reproduce this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)