You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Harmer (Jira)" <ji...@apache.org> on 2021/03/12 16:29:00 UTC

[jira] [Created] (TIKA-3316) Illegal IOException processing XPS files

Nick Harmer created TIKA-3316:
---------------------------------

             Summary: Illegal IOException processing XPS files
                 Key: TIKA-3316
                 URL: https://issues.apache.org/jira/browse/TIKA-3316
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 1.24
            Reporter: Nick Harmer
         Attachments: test1.xps, test2.xps, test3.xps, test4.xps

I have a number of (relatively simple) XPS documents which Tika fails to process.  The following exception appears:
{code:java}
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4149c063
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
        at com.mcms.Main.parseFile(Main.java:88)
        at com.mcms.Main.main(Main.java:59)
Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: Unsupported feature data descriptor used in entry Documents/1/Metadata/Page1_Thumbnail.JPG
        at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:477)
        at java.base/java.io.FilterInputStream.read(Unknown Source)
        at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:80)
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:182)
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136)
        at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
        at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
        at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
        at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        ... 5 more
{code}
 

Obviously the generator for these files (XPS printer driver from Notepad) adds a per-page thumbnail image which Tika doesn't like.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)