You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2014/09/20 17:48:33 UTC

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

    [ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142056#comment-14142056 ] 

Nick Burch commented on TIKA-1420:
----------------------------------

Are you envisioning something that will look for certain kinds of information in the primary parser's output (eg snippets of xhtml), that will then be used to build the metadata? Or would the information to go into the metadata come from somewhere else?

> Add Metadata Extraction to Arbitrary Parsers
> --------------------------------------------
>
>                 Key: TIKA-1420
>                 URL: https://issues.apache.org/jira/browse/TIKA-1420
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tyler Palsulich
>            Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it to a Metadata Object. This type of task is best handled by a... Handler. But, Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>      a) Assume the Handler passed to this Parser is already filled with content. So, we could simply get whatever content from the Handler and populate the Metadata directly.
>      b) Create a new Stream in the first Parser to pass to the second, which in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this scenario? Or, can we create some sort of... wrapper for a Handler which can accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)