You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/06/25 16:56:22 UTC

[jira] [Commented] (TIKA-1109) Metadata not extracted before the context in OOXML (pptx)

    [ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693097#comment-13693097 ] 

Nick Burch commented on TIKA-1109:
----------------------------------

Some parsers fetch the metadata first, some do it after the text, some populate the metadata as they make their way through the file, and some do a mixture! The current parser contract is on that the metadata will be populated by the end of the call to parse, not that it will be available during the parsing. It's up to the person writing the parser to do what makes most sense for their format.

If you need all the metadata before you process the text, you'll need to buffer the sax events yourself, sorry.
                
> Metadata not extracted before the context in OOXML (pptx)
> ---------------------------------------------------------
>
>                 Key: TIKA-1109
>                 URL: https://issues.apache.org/jira/browse/TIKA-1109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment Test</dc:title>).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira