You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tom Gross <it...@gmail.com> on 2011/06/23 19:09:16 UTC

Failing word parse with tika 0.9

Hi

I have a Word Document

maie.doc: CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code 
page: 1252, Title: Modul: Unternehmungsf\177hrung 5, Author: APO, 
Template: Normal.dot, Last Saved By: APO, Revision Number: 8, Name of 
Creating Application: Microsoft Office Word, Last Printed: Sun Apr 26 
23:38:00 2009, Create Time/Date: Sun Apr 26 23:38:00 2009, Last Saved 
Time/Date: Wed Apr 29 08:45:00 2009, Number of Pages: 1, Number of 
Words: 533, Number of Characters: 3364, Security: 0

which tika 0.9 can't parse. It fails with:

java -jar tika-app-0.9.jar ~/Download/maie.doc
Exception in thread "main" org.apache.tika.exception.TikaException: 
Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@ec0a9f9
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
         at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
Caused by: java.lang.NullPointerException
         at 
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
         at 
org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
         at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:828)
         at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:881)
         at 
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:127)
         at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
         at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
         at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
         ... 5 more

I can provide you the document in private, if someone is willing to dig 
into this. My Java version is

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (suse-1.2.1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

Thanks
-Tom


-- 
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C

Tom Gross
email..........tom@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de


Re: Failing word parse with tika 0.9

Posted by Tom Gross <it...@gmail.com>.
Upgrading to poi 3.8beta3 fixed the issue.

Thanks Nick!

On 06/23/2011 07:24 PM, Nick Burch wrote:
> On Thu, 23 Jun 2011, Tom Gross wrote:
>> which tika 0.9 can't parse. It fails with:
>>
>> Caused by: java.lang.NullPointerException
>> at
>> org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
>>
>> at org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
>
> This is an Apache POI bug. Can you try with a newer copy of Apache POI?
> (If you build Tika from svn it'll pull down a newer one, 3.8 beta 3)
>
> If that doesn't fix it, you'll need to file a bug report with POI, but
> you'll need to try with the latest version first!
>
> Nick
>


-- 
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C

Tom Gross
email..........tom@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de


Re: Failing word parse with tika 0.9

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 23 Jun 2011, Tom Gross wrote:
> which tika 0.9 can't parse. It fails with:
>
> Caused by: java.lang.NullPointerException
>        at 
> org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
>        at 
> org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)

This is an Apache POI bug. Can you try with a newer copy of Apache POI? 
(If you build Tika from svn it'll pull down a newer one, 3.8 beta 3)

If that doesn't fix it, you'll need to file a bug report with POI, but 
you'll need to try with the latest version first!

Nick