You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2011/10/01 12:13:45 UTC

[jira] [Created] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
----------------------------------------------------------------------------------------

                 Key: TIKA-735
                 URL: https://issues.apache.org/jira/browse/TIKA-735
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor


When I have an OpenOffice presentation (ODP) that embeds (OLE)
objects, in this case OpenOffice text, text from the embedded objects
is at the end of the presentation.

It's great that we are extracting the embedded text, but it'd be
better if each embedded object's text were inlined on the slide that
embedded it.

I have a simple test ODP with two slides.  Each slide has its own
text, and then embeds a text OLE object with text as well, and this is
the output:

{noformat}
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><div/>
<div><p>Main text on page 1</p>
</div>
<object/><div><div/>
</div>
<div/>
<div><ul>	<li><p>Main text on page 2</p>
</li>
</ul>
</div>
<object/><div><div/>
</div>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><p>Here is some embedded text on page 1</p>
</body></html><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="20970"/>
<meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
<meta name="resourceName" content="embeddedText.odp"/>
<title/>
</head>
<body><p>Here is some embedded text on page 2</p>
</body></html>
{noformat}

You can see "Here is some embedded text on page N" comes out at the end,
after the main text "Main text on page N" for both slides.

It's also odd that we get a new html/head/meta/body for each embedded
doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118861#comment-13118861 ] 

Jukka Zitting commented on TIKA-735:
------------------------------------

A parser should always produce valid XHTML output. If there's an embedded document that's fed into a recursive parse() call, the EmbeddedContentHandler and BodyContentHandler class can (and should) be used to include only the extracted body content of the embedded document. See the ParsingEmbeddedDocumentExtractor class for how this is done. In fact I'd recommend simply using the ParsingEmbeddedDocumentExtractor class directly, just like package, POIFS, and OOXML parsers already do.

Anyway, as mentioned by Nick elsewhere, it's probably not worth it to fix the current code since it'll probably in any case be rewritten to use the ODF toolkit.
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>	<li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118833#comment-13118833 ] 

Michael McCandless commented on TIKA-735:
-----------------------------------------

Ahhh, I see.

So it looks like our default behavior for embedded docs is to fully
extract them, concatenated to the end of the XHTML, as "full" XHTML
docs (ie new <html>...</html> each time).

But maybe we can change TikaCLI so that content from sub-docs is
optionally "inlined" instead.

I see TikaCLI already has the -z option, which extracts embedded
docs to separate fileN files in the current dir...

                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>	<li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118782#comment-13118782 ] 

Nick Burch commented on TIKA-735:
---------------------------------

I think this is a Tika CLI issue, rather than a Parser one. It should all depend on how you configure the recursing parser you attach to the parse context
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>	<li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-735) OpenOffice parser: embedded OLE docs are extracted at the end, as extra ...

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-735:
------------------------------------

    Attachment: embeddedText.odp

ODP document that leads to above text output from TikaCLI -x.
                
> OpenOffice parser: embedded OLE docs are extracted at the end, as extra <html>...</html>
> ----------------------------------------------------------------------------------------
>
>                 Key: TIKA-735
>                 URL: https://issues.apache.org/jira/browse/TIKA-735
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: embeddedText.odp
>
>
> When I have an OpenOffice presentation (ODP) that embeds (OLE)
> objects, in this case OpenOffice text, text from the embedded objects
> is at the end of the presentation.
> It's great that we are extracting the embedded text, but it'd be
> better if each embedded object's text were inlined on the slide that
> embedded it.
> I have a simple test ODP with two slides.  Each slide has its own
> text, and then embeds a text OLE object with text as well, and this is
> the output:
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><div/>
> <div><p>Main text on page 1</p>
> </div>
> <object/><div><div/>
> </div>
> <div/>
> <div><ul>	<li><p>Main text on page 2</p>
> </li>
> </ul>
> </div>
> <object/><div><div/>
> </div>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 1</p>
> </body></html><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="20970"/>
> <meta name="Content-Type" content="application/vnd.oasis.opendocument.presentation"/>
> <meta name="resourceName" content="embeddedText.odp"/>
> <title/>
> </head>
> <body><p>Here is some embedded text on page 2</p>
> </body></html>
> {noformat}
> You can see "Here is some embedded text on page N" comes out at the end,
> after the main text "Main text on page N" for both slides.
> It's also odd that we get a new html/head/meta/body for each embedded
> doc (there should be only one for the overall document).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira