You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/06/04 16:28:01 UTC

[jira] [Comment Edited] (TIKA-1212) Recursive Extraction of Archive File

    [ https://issues.apache.org/jira/browse/TIKA-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017721#comment-14017721 ] 

Tim Allison edited comment on TIKA-1212 at 6/4/14 2:26 PM:
-----------------------------------------------------------

[~gagravarr], I'm not sure that the example code works with the attached.  The issue is that the code keeps appending to location whether or not there is new depth.

The structure in the attached is:
{noformat}
test_recursive.docx
   embed1.zip
       embed1a.txt
       embed1b.txt
       embed2.zip
          embed2a.txt
          embed2b.txt
          embed3.zip
              embed3.txt
              embed4.zip
                   embed4.txt
   
{noformat}

{noformat}
----
Resource is test_recursive.docx/embedded-1/image1.emf
----
embeddedRelationshipId=rId7 Content-Type=application/x-emf resourceName=image1.emf 
----

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1a.txt 
----
embed_1a

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1b.txt 
----
embed_1b

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2a.txt 
----
embed_2a

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2b.txt 
----
embed_2b

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed3.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed3.txt 
----
embed_3

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip/embed4.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed4.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed4.txt 
----
embed_4

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip
----
embeddedRelationshipId=embed4.zip Content-Type=application/zip resourceName=embed4.zip 
----

embed4.txt


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip
----
embeddedRelationshipId=embed3.zip Content-Type=application/zip resourceName=embed3.zip 
----

embed3.txt


embed4.zip


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip
----
embeddedRelationshipId=embed2.zip Content-Type=application/zip resourceName=embed2.zip 
----

embed2a.txt


embed2b.txt


embed3.zip


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip
----
embeddedRelationshipId=rId8 Content-Type=application/zip resourceName=embed1.zip 
----

embed1a.txt


embed1b.txt


embed2.zip


----
Resource is test_recursive.docx/embedded-1
----
cp:revision=1 meta:save-date=2014-06-04T14:19:00Z Application-Name=Microsoft Office Word dcterms:created=2014-06-04T14:19:00Z Application-Version=14.0000 Character-Count-With-Spaces=30 date=2014-06-04T14:19:00Z extended-properties:Template=Normal.dotm meta:line-count=1 publisher= Word-Count=4 meta:paragraph-count=1 Creation-Date=2014-06-04T14:19:00Z extended-properties:AppVersion=14.0000 Line-Count=1 extended-properties:Application=Microsoft Office Word Paragraph-Count=1 Last-Save-Date=2014-06-04T14:19:00Z Revision-Number=1 dcterms:modified=2014-06-04T14:19:00Z meta:creation-date=2014-06-04T14:19:00Z Template=Normal.dotm Page-Count=1 meta:character-count=27 Last-Modified=2014-06-04T14:19:00Z extended-properties:Company= meta:word-count=4 modified=2014-06-04T14:19:00Z xmpTPg:NPages=1 dc:publisher= Character Count=27 meta:page-count=1 meta:character-count-with-spaces=30 Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document 
----



embed_0  

{noformat}


was (Author: tallison@mitre.org):
[~gagravarr], I'm not sure that the example code works with the attached.  The issue is that the code keeps appending to whether or not there is new depth.

The structure in the attached is:
{noformat}
test_recursive.docx
   embed1.zip
       embed1a.txt
       embed1b.txt
       embed2.zip
          embed2a.txt
          embed2b.txt
          embed3.zip
              embed3.txt
              embed4.zip
                   embed4.txt
   
{noformat}

{noformat}
----
Resource is test_recursive.docx/embedded-1/image1.emf
----
embeddedRelationshipId=rId7 Content-Type=application/x-emf resourceName=image1.emf 
----

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1a.txt 
----
embed_1a

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed1b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed1b.txt 
----
embed_1b

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2a.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2a.txt 
----
embed_2a

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed2b.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed2b.txt 
----
embed_2b

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed3.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed3.txt 
----
embed_3

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip/embed4.txt
----
Content-Encoding=ISO-8859-1 embeddedRelationshipId=embed4.txt Content-Type=text/plain; charset=ISO-8859-1 resourceName=embed4.txt 
----
embed_4

----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip/embed3.txt/embed4.zip
----
embeddedRelationshipId=embed4.zip Content-Type=application/zip resourceName=embed4.zip 
----

embed4.txt


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip/embed2a.txt/embed2b.txt/embed3.zip
----
embeddedRelationshipId=embed3.zip Content-Type=application/zip resourceName=embed3.zip 
----

embed3.txt


embed4.zip


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip/embed1a.txt/embed1b.txt/embed2.zip
----
embeddedRelationshipId=embed2.zip Content-Type=application/zip resourceName=embed2.zip 
----

embed2a.txt


embed2b.txt


embed3.zip


----
Resource is test_recursive.docx/embedded-1/image1.emf/embed1.zip
----
embeddedRelationshipId=rId8 Content-Type=application/zip resourceName=embed1.zip 
----

embed1a.txt


embed1b.txt


embed2.zip


----
Resource is test_recursive.docx/embedded-1
----
cp:revision=1 meta:save-date=2014-06-04T14:19:00Z Application-Name=Microsoft Office Word dcterms:created=2014-06-04T14:19:00Z Application-Version=14.0000 Character-Count-With-Spaces=30 date=2014-06-04T14:19:00Z extended-properties:Template=Normal.dotm meta:line-count=1 publisher= Word-Count=4 meta:paragraph-count=1 Creation-Date=2014-06-04T14:19:00Z extended-properties:AppVersion=14.0000 Line-Count=1 extended-properties:Application=Microsoft Office Word Paragraph-Count=1 Last-Save-Date=2014-06-04T14:19:00Z Revision-Number=1 dcterms:modified=2014-06-04T14:19:00Z meta:creation-date=2014-06-04T14:19:00Z Template=Normal.dotm Page-Count=1 meta:character-count=27 Last-Modified=2014-06-04T14:19:00Z extended-properties:Company= meta:word-count=4 modified=2014-06-04T14:19:00Z xmpTPg:NPages=1 dc:publisher= Character Count=27 meta:page-count=1 meta:character-count-with-spaces=30 Content-Type=application/vnd.openxmlformats-officedocument.wordprocessingml.document 
----



embed_0  

{noformat}

> Recursive Extraction of Archive File
> ------------------------------------
>
>                 Key: TIKA-1212
>                 URL: https://issues.apache.org/jira/browse/TIKA-1212
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Vikram
>            Priority: Critical
>         Attachments: RecursiveMetadataParserZukka.java, TIKA-Output.xlsx, abc.zip, abc.zip, test_recursive_embedded.docx
>
>
> Please refer the code: http://wiki.apache.org/tika/RecursiveMetadata#Main_from_Jukka.27s_Example
> Requirement:
> -----------------
> abc.zip
>    ---> a.doc
>    ---> b.xls
>    ---> pqr.zip
>   -------------> m.ppt
> There are two issues with TIKA:
> 1. How to block extraction embedded doc separately optionally?
> 2. When I extract recussively, file name / or resourceKeyName is not coming properly. For example
>     --> a.doc should have value  abc.zip/a.doc. Similarily for b.xls. This is fine BUT m.ppt is having resource file name as pqr/m.ppt which is WRONG. This should have value abc.zip/pqr.zip/m.ppt.
>     --> Even for the Embedded doc, only random name is coming.. not even with proper file path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)