You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2014/06/24 18:08:24 UTC

[jira] [Resolved] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

     [ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-1353.
------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6

I've fixed those TODOs in r1605124. Now, if a TikaInputStream is given, the ODF file is processed in a random access way, with the metadata handled first. If it's just a regular stream, then the previous "iterate in turn" behaviour continues

> OpenDocumentParser doesn't correctly process metadata
> -----------------------------------------------------
>
>                 Key: TIKA-1353
>                 URL: https://issues.apache.org/jira/browse/TIKA-1353
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.5
>            Reporter: Steve R
>             Fix For: 1.6
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When using OpenDocumentParser, the metadata isn't set correctly. When using it to write an html file, the only metadata that it knows about is content type because it is set ahead of time.
> The problem is that when iterating over the zip contents, meta.xml isn't processed before content.xml. The metadata set on the parse object is correct after parse() returns, however the contents of the resulting html file is missing all of the metadata.
> Changing the code to be 
> boolean parsedMetaData = false;
> boolean delayLoadContent = false;
> while (entry != null) {
> ...
> } else if (entry.getName().equals("meta.xml")) {
>                 meta.parse(zip, new DefaultHandler(), metadata, context);
>                 parsedMetaData = true;
>                 if (delayLoadContent) {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             } else if (entry.getName().endsWith("content.xml")) {
>                 if (!parsedMetaData) {
>                     delayLoadContent = true;
>                 } else {
>                     if (content instanceof OpenDocumentContentParser) {
>                         ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
>                     } else {
>                         // Foreign content parser was set:
>                         content.parse(zip, handler, metadata, context);
>                     }
>                 }
>             }
> works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)