You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2011/06/13 19:12:51 UTC

[jira] [Created] (TIKA-675) PackageExtractor should track names of recursively nested resources

PackageExtractor should track names of recursively nested resources
-------------------------------------------------------------------

                 Key: TIKA-675
                 URL: https://issues.apache.org/jira/browse/TIKA-675
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Andrzej Bialecki 


When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192200#comment-13192200 ] 

Nick Burch commented on TIKA-675:
---------------------------------

We could probably do this with a wrapper parser, which tracks the name, outputs the nested name to the metadata, then delegates a different parser for the actual processing

If we added this, we'd need to decide on what metadata key to put this in (a new one, or change the resource name?), and how to separate parts (maybe an ! like in VFS?)

It should be very quick to do though, once those are decided
                
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049069#comment-13049069 ] 

Nick Burch commented on TIKA-675:
---------------------------------

Not all containers have names for their embedded resources, so we'd need to think about that in any scheme that's adopted

> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049159#comment-13049159 ] 

Andrzej Bialecki  commented on TIKA-675:
----------------------------------------

Good point. For example Aperture assigns sequential id-s for resources that don't have names (e.g. parts in a mime message).

> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the current embedded component's name is preserved under Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice to build pseudo-urls for nested resources. In case of Tika API that uses streams this could look like {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or otherwise track the parent-child relationship - e.g. some applications need this information to indicate what composite documents to delete from the index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira