You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2013/06/27 14:26:21 UTC

[jira] [Commented] (TIKA-1109) Metadata not extracted before the content in OOXML (pptx)

    [ https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659 ] 

Daniel Bonniot de Ruisselet commented on TIKA-1109:
---------------------------------------------------

I tried it. It broke two tests (same cause): as you mentioned, in excel the metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a change in how that is implemented, and:

{{[INFO] ------------------------------------------------------------------------}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO] ------------------------------------------------------------------------}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO] ------------------------------------------------------------------------}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent ................................ SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .................................. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ............................... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ................................... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ........................... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ........................... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server ................................ SUCCESS [5.312s]}}
{{[INFO] Apache Tika ....................................... SUCCESS [0.014s]}}
{{[INFO] ------------------------------------------------------------------------}}
{{[INFO] BUILD SUCCESS}}
{{[INFO] ------------------------------------------------------------------------}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO] ------------------------------------------------------------------------}}

{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java       |   11 -}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java |   36 ++----}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java             |   56 ++++++++++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}

{{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}}


The logic is OOXMLExtractorFactory is now simpler, since I could remove the extra shielding, which I suppose was made necessary by the previous ordering.

And the metadata for OOXML formats is now available at parse time, as tested by the added test to OOXMLParserTest :)

                
> Metadata not extracted before the content in OOXML (pptx)
> ---------------------------------------------------------
>
>                 Key: TIKA-1109
>                 URL: https://issues.apache.org/jira/browse/TIKA-1109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read after the text. This means it's impossible to use the medata while processing the text. I think it would be more useful to have the metadata populated first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment Test</dc:title>).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira