You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2011/06/14 11:12:53 UTC

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

    [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049069#comment-13049069 ] 

Nick Burch commented on TIKA-675:
---------------------------------

Not all containers have names for their embedded resources, so we'd need to think about that in any scheme that's adopted

> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira