You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2014/04/23 08:23:15 UTC

[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

    [ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977887#comment-13977887 ] 

Rupert Westenthaler commented on TIKA-1109:
-------------------------------------------

To workaround the reported issues I created an own bundle for Tika 1.5 [1]. This bundle does not embed commons-compress, xz, commons-codec, commons-io as those are anyway required by other Apache Stanbol modules and therefore guaranteed to be around in the OSGI environment. Not sure if Tika would like to embed those to avoid dependencies to other bundles.

If you like me to create a patch for 1.5 or 1.6-SNAPSHOT just leave a short comment.


[1] http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/commons/tikabundle/pom.xml

> Metadata not extracted before the content in OOXML (pptx)
> ---------------------------------------------------------
>
>                 Key: TIKA-1109
>                 URL: https://issues.apache.org/jira/browse/TIKA-1109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>              Labels: patch
>             Fix For: 1.5
>
>         Attachments: TIKA-1109.patch
>
>
> It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment Test</dc:title>).



--
This message was sent by Atlassian JIRA
(v6.2#6252)