You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/26 13:48:24 UTC

[jira] [Commented] (TIKA-2251) TIKA-198 due to java.util.zip.ZipException: invalid literal/lengths set

    [ https://issues.apache.org/jira/browse/TIKA-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839689#comment-15839689 ] 

Tim Allison commented on TIKA-2251:
-----------------------------------

Thank you for opening this issue.

{{header1.xml}} appears to be genuinely corrupt.  Winzip isn't able to decompress it.  MSWord complains that something is wrong with the file and offers to fix it.

The new experimental SAX docx parser throws the same Exception. 

Would your preference be to catch+log this exception and continue with extraction with null headers?



> TIKA-198 due to java.util.zip.ZipException: invalid literal/lengths set
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2251
>                 URL: https://issues.apache.org/jira/browse/TIKA-2251
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Jorge Spinsanti
>         Attachments: ZipException.docx
>
>
> I got an exception to extract text from file. See stacktrace associated and file attached to reproduce:
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@7f54cc49
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
> 	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
> 	at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
> 	at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:207)
> 	at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
> 	at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
> 	at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> 	at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> 	at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> 	at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
> 	at org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:137)
> 	at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:115)
> 	at org.openxmlformats.schemas.wordprocessingml.x2006.main.HdrDocument$Factory.parse(Unknown Source)
> 	at org.apache.poi.xwpf.usermodel.XWPFHeader.onDocumentRead(XWPFHeader.java:108)
> 	at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:212)
> 	at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
> 	at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:124)
> 	at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
> 	at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
> 	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	... 23 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)