You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/03/15 20:37:33 UTC

[jira] [Comment Edited] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

    [ https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045 ] 

Tim Allison edited comment on TIKA-1903 at 3/15/16 7:37 PM:
------------------------------------------------------------

Copied from [~rgauss] on TIKA-1607:

Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes a Metadata object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor {
    
    /**
     * Gets the map of known embedded resources or null if no resources
     * were stored during parsing
     * 
     * @return the embedded resources
     */
    Map<Metadata, byte[]> getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to include some EmbeddedResources object to be optionally populated along with the Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?



was (Author: tallison@mitre.org):
Copied from [~rgauss] on TIKA-1607:

bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes a Metadata object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor {
    
    /**
     * Gets the map of known embedded resources or null if no resources
     * were stored during parsing
     * 
     * @return the embedded resources
     */
    Map<Metadata, byte[]> getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to include some EmbeddedResources object to be optionally populated along with the Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?


> Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1903
>                 URL: https://issues.apache.org/jira/browse/TIKA-1903
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> On TIKA-1607, we veered a bit from allowing flexible metadata structures to how to handle embedded metadata documents, such as XMP.  Let's use this issue to discuss and design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)