You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2011/10/01 14:45:34 UTC

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

    [ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118782#comment-13118782 ] 

Nick Burch commented on TIKA-735:
---------------------------------

I think this is a Tika CLI issue, rather than a Parser one. It should all depend on how you configure the recursing parser you attach to the parse context
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>	<li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira