You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/06/04 16:30:02 UTC

[jira] [Commented] (TIKA-1212) Recursive Extraction of Archive File

    [ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017723#comment-14017723 ] 

Tim Allison commented on TIKA-1212:
-----------------------------------

The only solution that I could find was to use a tracker class in ParseContext, but that required a new ParseContext with each embedded parse...which would be ok, but that means throwing out whatever the user added to ParseContext because we don't currently have a way of cloning ParseContext.  I'm sure there are other ways to do this, and I'm not happy with my solution...

> Recursive Extraction of Archive File
> ------------------------------------
>
>                 Key: TIKA-1212
>                 URL: https://issues.apache.org/jira/browse/TIKA-1212
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Vikram
>            Priority: Critical
>         Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, abc.zip, abc.zip, test_recursive_embedded.docx
>
>
> Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
> Requirement:
> -----------------
> abc.zip
>    ---> a.doc
>    ---> b.xls
>    ---> pqr.zip
>   -------------> m.ppt
> There are two issues with TIKA:
> 1. How to block extraction embedded doc separately optionally?
> 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example
>     --> a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt.
>     --> Even for the Embedded doc, only random name is coming.. not even with proper file path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)