You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tom Gross <it...@gmail.com> on 2011/06/23 19:09:16 UTC
Failing word parse with tika 0.9
Hi
I have a Word Document
maie.doc: CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code
page: 1252, Title: Modul: Unternehmungsf\177hrung 5, Author: APO,
Template: Normal.dot, Last Saved By: APO, Revision Number: 8, Name of
Creating Application: Microsoft Office Word, Last Printed: Sun Apr 26
23:38:00 2009, Create Time/Date: Sun Apr 26 23:38:00 2009, Last Saved
Time/Date: Wed Apr 29 08:45:00 2009, Number of Pages: 1, Number of
Words: 533, Number of Characters: 3364, Security: 0
which tika 0.9 can't parse. It fails with:
java -jar tika-app-0.9.jar ~/Download/maie.doc
Exception in thread "main" org.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@ec0a9f9
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
Caused by: java.lang.NullPointerException
at
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
at
org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
at org.apache.poi.hwpf.usermodel.Range.getParagraph(Range.java:828)
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:881)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:127)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:182)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 5 more
I can provide you the document in private, if someone is willing to dig
into this. My Java version is
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (suse-1.2.1-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
Thanks
-Tom
--
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C
Tom Gross
email..........tom@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de
Re: Failing word parse with tika 0.9
Posted by Tom Gross <it...@gmail.com>.
Upgrading to poi 3.8beta3 fixed the issue.
Thanks Nick!
On 06/23/2011 07:24 PM, Nick Burch wrote:
> On Thu, 23 Jun 2011, Tom Gross wrote:
>> which tika 0.9 can't parse. It fails with:
>>
>> Caused by: java.lang.NullPointerException
>> at
>> org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
>>
>> at org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
>
> This is an Apache POI bug. Can you try with a newer copy of Apache POI?
> (If you build Tika from svn it'll pull down a newer one, 3.8 beta 3)
>
> If that doesn't fix it, you'll need to file a bug report with POI, but
> you'll need to try with the latest version first!
>
> Nick
>
--
Auther of the book "Plone 3 Multimedia" - http://amzn.to/dtrp0C
Tom Gross
email..........tom@toms-projekte.de
skype.....................tom_gross
web.........http://toms-projekte.de
blog...http://blog.toms-projekte.de
Re: Failing word parse with tika 0.9
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 23 Jun 2011, Tom Gross wrote:
> which tika 0.9 can't parse. It fails with:
>
> Caused by: java.lang.NullPointerException
> at
> org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:47)
> at
> org.apache.poi.hwpf.model.PAPX.getParagraphProperties(PAPX.java:136)
This is an Apache POI bug. Can you try with a newer copy of Apache POI?
(If you build Tika from svn it'll pull down a newer one, 3.8 beta 3)
If that doesn't fix it, you'll need to file a bug report with POI, but
you'll need to try with the latest version first!
Nick