You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Harry Simons <si...@gmail.com> on 2012/03/07 10:07:44 UTC

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Hello,

When converting a bunch of Microsoft Word documents using the command,

     java -jar tika-app-1.1-SNAPSHOT.jar -v -t

, I'm getting the following exception.

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@5d3ac0
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
     at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
     at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
     at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
     at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
     at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48)
     at 
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67)
     at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103)
     at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943)
     at 
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146)
     at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
     at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
     at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     ... 4 more




Any idea how to avoid getting this error?

Because these are internal business documents, I may not be able to share them 
with you guys so would greatly appreciate a fix or a workaround.

Noticed that with 'tika-app-1.0.jar', an even greater number of files would fail 
to convert. So, things definitely seem to have improved with version 1.1.

Regards,
/HS



Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Mark Kerzner <ma...@shmsoft.com>.
Hi,

for y'alls information, since I can afford to leave one document
unprocessed (and send it to exceptions), here is what I did

catch (OutOfMemoryError m) {
   // send to exception
}

and the whole processing for eDiscovery <http://freeeed.org/> works fine
for me.

Sincerely,
Mark

On Thu, Mar 8, 2012 at 8:20 PM, Harry Simons <si...@gmail.com> wrote:

> On 03/08/2012 10:11 PM, Nick Burch wrote:
>
>> On Thu, 8 Mar 2012, Harry Simons wrote:
>>
>>> If you're able to share the error log, that could be helpful
>>>>
>>>>  ------------------------------**--------------
>>> <BFFValidation path="failing.doc" datetime="03/08/12 07:14:27"
>>> result="FAILED">
>>>
>>
>> Any chance that you could raise a new POI bug, post the stracktrace and
>> this validation file? Hopefully someone who knows HWPF better than I do can
>> then take a look and see
>>
>> Nick
>>
>>  Fyi, here's the bug raised: https://issues.apache.org/**
> bugzilla/show_bug.cgi?id=52863<https://issues.apache.org/bugzilla/show_bug.cgi?id=52863>
>
> Appreciate your time and interest in this issue.
>
> Regards,
> /HS
>
>
>

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Harry Simons <si...@gmail.com>.
On 03/08/2012 10:11 PM, Nick Burch wrote:
> On Thu, 8 Mar 2012, Harry Simons wrote:
>>> If you're able to share the error log, that could be helpful
>>>
>> --------------------------------------------
>> <BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED">
>
> Any chance that you could raise a new POI bug, post the stracktrace and this 
> validation file? Hopefully someone who knows HWPF better than I do can then 
> take a look and see
>
> Nick
>
Fyi, here's the bug raised: https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

Appreciate your time and interest in this issue.

Regards,
/HS



Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Mar 2012, Harry Simons wrote:
>> If you're able to share the error log, that could be helpful
>> 
> --------------------------------------------
> <BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" 
> result="FAILED">

Any chance that you could raise a new POI bug, post the stracktrace and 
this validation file? Hopefully someone who knows HWPF better than I do 
can then take a look and see

Nick

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Harry Simons <si...@gmail.com>.
On 03/08/2012 04:43 PM, Nick Burch wrote:
> On Thu, 8 Mar 2012, Harry Simons wrote:
>> I tried the BFF Validator, and it is indeed failing!
>
> If you're able to share the error log, that could be helpful
>
--------------------------------------------
<BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED">
<ParseStack>
<Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure" 
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737">
<Info>Built-in type "Docfile": The root storage object of an OLE compound file. 
For more information, see 
http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info>
</Type>
<Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure" 
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737" 
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0">
<Info>Built-in type "Stream": Any stream object for OLE compound files. The 
entire file contents for other files.</Info>
</Type>
<Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1" 
msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394" 
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2" 
msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB" 
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type builtinType="USHORT" streamName="WordDocument" bitfield="True" 
bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4" 
streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10" 
hexStreamOffset="0xa" childId="10" hexChildId="0xa">
<Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info>
</Type>
</ParseStack>
<LastData><![CDATA[
EC A5 01 01 4D 20 09 04  00 00 08 12 BF 00 00 00  ....M...........
00 00 00 30 00 00 00 00  00 08 00 00 66 EF 00 00  ...0........f...
]]></LastData>
</BFFValidation>
--------------------------------------------

>> However, the file got created by MS Word only, and I doubt if it's 
>> 'corrupt'... since both MS Word and LibreOffice can load it fine without any 
>> errors or even warnings of any kind -- everything seems to be normal with 
>> these apps. I can even use LibreOffice 3.5 to convert it to pdf or to a .zip 
>> of xml's.
>
> If you load it in word, and do a save-as, does the new .doc file show the same 
> problem?

No, then it /is/ able to extract the works the appends the following to the 
extracted text:

--------------------------------------------
_-1388201556/ole-[42, 4D, 0E, 0A, 00, 00, 00, 00]

_-1388203796/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388843352/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388845272/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297360/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388297680/ole-[42, 4D, D6, 09, 00, 00, 00, 00]

_-1388296720/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1388203476/ole-[42, 4D, 66, 09, 00, 00, 00, 00]

_-1382869532/ole-[42, 4D, 36, 0C, 00, 00, 00, 00]

_-1388200596/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1388200916/ole-[42, 4D, BA, 09, 00, 00, 00, 00]

_-1383036196/ole-[42, 4D, 12, 09, 00, 00, 00, 00]

_-1382867932/ole-[42, 4D, 86, 0A, 00, 00, 00, 00]

_-1382868252/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]

_-1380808936/ole-[42, 4D, 2E, 0A, 00, 00, 00, 00]
--------------------------------------------

I have 1000s of such documens, hoping I'll not have to repeat this process for 
each one for them. :-(

I don't know what version of Word the original document got created with, but I 
used MS Word 2007 for the 'Save as' you just suggested.

>
>> Do you/others still feel it could be addressed by a POI upgrade?
>
> You could try with the Tika 1.1 release candidate, that has the latest POI 
> release in it. You could also try dropping in a recent POI nightly build to 
> see if that helps - Tika will upgrade shortly to POI 3.8 beta 6 once that's out
>
Tika 1.1 release candidate made no difference. It gave same behavior with both 
the files:
     original file: same exceptions
     re-saved : same extraneous text appended (pasted above).

>
>> Also, I thought Tika uses POI and would be using POI as a .jar. But looking 
>> in Tika sources, I could find only *POI*.java files but no *POI*.jar or 
>> *poi*.jar file(s).
>
> Depends how you use Tika. The Tika-App inlines all the dependencies, the Tika 
> OSGi Bundle has them individually as jars in the bundle, or Maven will 
> download them for you
>
Seems like the OSGi bundle may be the right packaging choice for me to allow POI 
upgrades independent of Tika. Never used maven or OSGi... is there a link I can 
download the OSGi bundle from and then follow instructions that come with it? I 
can't see it on the Tika site anywhere.


Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Mark Kerzner <ma...@shmsoft.com>.
I have a similar problem, (Java memory exception is what I get), how do I
use the 1.1 RC? Same repo, as below?

        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.0-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.0-SNAPSHOT</version>
        </dependency>


On Thu, Mar 8, 2012 at 5:13 AM, Nick Burch <ni...@alfresco.com> wrote:

> On Thu, 8 Mar 2012, Harry Simons wrote:
>
>> I tried the BFF Validator, and it is indeed failing!
>>
>
> If you're able to share the error log, that could be helpful
>
>  However, the file got created by MS Word only, and I doubt if it's
>> 'corrupt'... since both MS Word and LibreOffice can load it fine without
>> any errors or even warnings of any kind -- everything seems to be normal
>> with these apps. I can even use LibreOffice 3.5 to convert it to pdf or to
>> a .zip of xml's.
>>
>
> If you load it in word, and do a save-as, does the new .doc file show the
> same problem?
>
>  Do you/others still feel it could be addressed by a POI upgrade?
>>
>
> You could try with the Tika 1.1 release candidate, that has the latest POI
> release in it. You could also try dropping in a recent POI nightly build to
> see if that helps - Tika will upgrade shortly to POI 3.8 beta 6 once that's
> out
>
>
>  Also, I thought Tika uses POI and would be using POI as a .jar. But
>> looking in Tika sources, I could find only *POI*.java files but no
>> *POI*.jar or *poi*.jar file(s).
>>
>
> Depends how you use Tika. The Tika-App inlines all the dependencies, the
> Tika OSGi Bundle has them individually as jars in the bundle, or Maven will
> download them for you
>
> Nick
>



-- 


Mark Kerzner, CEO, SHMsoft <http://shmsoft.com/>

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Mar 2012, Harry Simons wrote:
> I tried the BFF Validator, and it is indeed failing!

If you're able to share the error log, that could be helpful

> However, the file got created by MS Word only, and I doubt if it's 
> 'corrupt'... since both MS Word and LibreOffice can load it fine without any 
> errors or even warnings of any kind -- everything seems to be normal with 
> these apps. I can even use LibreOffice 3.5 to convert it to pdf or to a .zip 
> of xml's.

If you load it in word, and do a save-as, does the new .doc file show the 
same problem?

> Do you/others still feel it could be addressed by a POI upgrade?

You could try with the Tika 1.1 release candidate, that has the latest POI 
release in it. You could also try dropping in a recent POI nightly build 
to see if that helps - Tika will upgrade shortly to POI 3.8 beta 6 once 
that's out


> Also, I thought Tika uses POI and would be using POI as a .jar. But 
> looking in Tika sources, I could find only *POI*.java files but no 
> *POI*.jar or *poi*.jar file(s).

Depends how you use Tika. The Tika-App inlines all the dependencies, the 
Tika OSGi Bundle has them individually as jars in the bundle, or Maven 
will download them for you

Nick

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Harry Simons <si...@gmail.com>.
Nick, thanks for your response.

I tried the BFF Validator, and it is indeed failing!

However, the file got created by MS Word only, and I doubt if it's 'corrupt'... 
since both MS Word and LibreOffice can load it fine without any errors or even 
warnings of any kind -- everything seems to be normal with these apps. I can 
even use LibreOffice 3.5 to convert it to pdf or to a .zip of xml's.

 > This looks like a POI bug
Do you/others still feel it could be addressed by a POI upgrade?

Also, I thought Tika uses POI and would be using POI as a .jar. But looking in 
Tika sources, I could find only *POI*.java files but no *POI*.jar or *poi*.jar 
file(s).

/HS

On 03/07/2012 06:08 PM, Nick Burch wrote:
> On Wed, 7 Mar 2012, Harry Simons wrote:
>> When converting a bunch of Microsoft Word documents using the command,
>>
>>    java -jar tika-app-1.1-SNAPSHOT.jar -v -t
>>
>> , I'm getting the following exception.
>>
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
>>    at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
>>    at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
>
> This looks like a POI bug
>
>
>> Because these are internal business documents, I may not be able to share 
>> them with you guys so would greatly appreciate a fix or a workaround.
>
> That's going to make fixing it much trickier. You'll need to raise a POI bug, 
> and be willing to do lots of investigating
>
> It may also be worth running the Binary File Format Validator 
> <http://poi.apache.org/faq.html#faq-N10109> against the file, to check it's a 
> valid and not corrupted
>
> Nick
>

Re: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 7 Mar 2012, Harry Simons wrote:
> When converting a bunch of Microsoft Word documents using the command,
>
>    java -jar tika-app-1.1-SNAPSHOT.jar -v -t
>
> , I'm getting the following exception.
>
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
>    at 
> org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
>    at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)

This looks like a POI bug


> Because these are internal business documents, I may not be able to share 
> them with you guys so would greatly appreciate a fix or a workaround.

That's going to make fixing it much trickier. You'll need to raise a POI 
bug, and be willing to do lots of investigating

It may also be worth running the Binary File Format Validator 
<http://poi.apache.org/faq.html#faq-N10109> against the file, to check 
it's a valid and not corrupted

Nick