You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2011/06/13 19:12:51 UTC
[jira] [Created] (TIKA-675) PackageExtractor should track names of
recursively nested resources
PackageExtractor should track names of recursively nested resources
-------------------------------------------------------------------
Key: TIKA-675
URL: https://issues.apache.org/jira/browse/TIKA-675
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.0
Reporter: Andrzej Bialecki
When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-675) PackageExtractor should track names
of recursively nested resources
Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192200#comment-13192200 ]
Nick Burch commented on TIKA-675:
---------------------------------
We could probably do this with a wrapper parser, which tracks the name, outputs the nested name to the metadata, then delegates a different parser for the actual processing
If we added this, we'd need to decide on what metadata key to put this in (a new one, or change the resource name?), and how to separate parts (maybe an ! like in VFS?)
It should be very quick to do though, once those are decided
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
> Key: TIKA-675
> URL: https://issues.apache.org/jira/browse/TIKA-675
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.10
> Reporter: Andrzej Bialecki
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-675) PackageExtractor should track names
of recursively nested resources
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049069#comment-13049069 ]
Nick Burch commented on TIKA-675:
---------------------------------
Not all containers have names for their embedded resources, so we'd need to think about that in any scheme that's adopted
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
> Key: TIKA-675
> URL: https://issues.apache.org/jira/browse/TIKA-675
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Andrzej Bialecki
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-675) PackageExtractor should track names
of recursively nested resources
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049159#comment-13049159 ]
Andrzej Bialecki commented on TIKA-675:
----------------------------------------
Good point. For example Aperture assigns sequential id-s for resources that don't have names (e.g. parts in a mime message).
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
> Key: TIKA-675
> URL: https://issues.apache.org/jira/browse/TIKA-675
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.0
> Reporter: Andrzej Bialecki
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira