You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/03/12 14:03:13 UTC

[jira] [Commented] (TIKA-1092) Parsing of old Word file causes a TikaException

    [ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600006#comment-13600006 ] 

Nick Burch commented on TIKA-1092:
----------------------------------

I'm not sure that your problem file is actually a word document. The exception you're seeing is triggered by POI trying to open the file, but discovering that it's not actually an OLE2 document. POI can't handle very old office documents (pre about 95, but it varies between formats), but it can at least open the outer OLE2 container

Without the sample file I can't tell what your file actually is, but my best guess is that someone has renamed it to be .doc when it isn't anything like that
                
> Parsing of old Word file causes a TikaException
> -----------------------------------------------
>
>                 Key: TIKA-1092
>                 URL: https://issues.apache.org/jira/browse/TIKA-1092
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>              Labels: office, parse, word-exception
>
> I found an issue with the parse method of org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika Exception when it try to parse very old file of Microsoft Word.
> I think this issue is not a priority because the files that cause the exception belong to an obsolete format/structure that even new Microsoft Office versions don't support them, but it's important to know that something wrong about these outdated types can happen.
> I report two links about old types (Microsoft support perspective):
> http://support.microsoft.com/?kbid=922850
> http://support.microsoft.com/kb/922849/it
> For example, the message of TikaException is below:
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@789ab21d
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.IOException: Invalid header signature; read 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
> 	at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
> 	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
> 	at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira