You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/02/05 14:51:00 UTC

[jira] [Assigned] (TIKA-2564) Tika client cannot extract files from embedded archive formats

     [ https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison reassigned TIKA-2564:
---------------------------------

    Assignee: Tim Allison

> Tika client cannot extract files from embedded archive formats
> --------------------------------------------------------------
>
>                 Key: TIKA-2564
>                 URL: https://issues.apache.org/jira/browse/TIKA-2564
>             Project: Tika
>          Issue Type: Bug
>         Environment: Mac OS 10.13.3 (17D47)
>  
> 17:42 ext$ java -version
> java version "9.0.1"
> Java(TM) SE Runtime Environment (build 9.0.1+11)
> Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)
> 17:42 ext$ uname -a
> Darwin bix.local 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64
>  
>  
>            Reporter: Marc Prud'hommeaux
>            Assignee: Tim Allison
>            Priority: Major
>
>  
> This may be related to TIKA-2395. When trying to extract the files from 
> tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz 
>  
> % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- --extract test-documents.tgz
> I see the exception:
>  
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at coursier.cli.qR.a(Unknown Source)
> at coursier.cli.qQ.j(Unknown Source)
> at coursier.cli.qW.a(Unknown Source)
> at d.h.a.c(Unknown Source)
> at b.b.c_(Unknown Source)
> at d.b.d.E.g(Unknown Source)
> at d.b.e.aW.g(Unknown Source)
> at d.b.f.b.aa.a(Unknown Source)
> at coursier.cli.qQ.b(Unknown Source)
> at coursier.cli.Q.b(Unknown Source)
> at b.J.c_(Unknown Source)
> at d.F.h(Unknown Source)
> at b.F.a(Unknown Source)
> at coursier.cli.Coursier.main(Unknown Source)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at coursier.Bootstrap.main(Bootstrap.java:428)
> Caused by: java.io.IOException: mark/reset not supported
> at java.base/java.io.InputStream.reset(InputStream.java:474)
> at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444)
> at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
> at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045)
> at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 28 more
>  
> However, I can browse the document fine using:
>  
> % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI -- test-documents.tgz
>  
> This issue affects: test-documents.rar, test-documents.tar.Z, test-documents.tbz2, and test-documents.tgz
> But it does not affect test-documents.7z, test-documents.cab, test-documents.ddf, test-documents.dmg, test-documents.tar, or test-documents.zip
>  
>  
>  This makes me suspect that it has something to do with extracting files from packages that are embedded in other archive parsers.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)