You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2015/04/21 17:18:57 UTC

[Bug 57843] New: RuntimeException on extracting text from Word 97-2004 Document

https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

            Bug ID: 57843
           Summary: RuntimeException on extracting text from Word 97-2004
                    Document
           Product: POI
           Version: 3.12-dev
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
          Assignee: dev@poi.apache.org
          Reporter: jeremy.merrill@nytimes.com

Created attachment 32674
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=32674&action=edit
failing document

Trying to parse this document via Tika. It appears to be a pretty vanilla Word
97 era .doc. It opens correctly in Word for Mac 2011.

It's attached. The document is already publicly posted and I grant any rights I
have in the document to ASF; I should note that it's part of a publicly-posted
dump of emails sent to/from former Florida Gov. Jeb Bush, so I don't hold
copyright over it.

This is the POI version of https://issues.apache.org/jira/browse/TIKA-1608

Stacktrace looks like this:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text
1534-attachment.doc
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101)
at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:109)
at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

Sergei Malafeev <se...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sergeymalafeev@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|greg.woolsey@gmail.com      |

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

--- Comment #1 from Javen O'Neal <on...@apache.org> ---
POI detected this as a Word 95 or older file, requiring HWPFOldDocument to read
the file. 

The file claims it is a Microsoft Word 6.0 Document, which is the file format
of Word 6.0, released in 1993. [1] 

[1] https://en.wikipedia.org/wiki/Microsoft_Word#Release_history

I got the same error as you in the latest version of POI, 3.16 trunk. I added
this failing unit test in r1761873.

> java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
>   at org.apache.poi.hwpf.model.PAPFormattedDiskPage.<init>(PAPFormattedDiskPage.java:101)
>   at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:49)
>   at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:107)
>   at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:45)
>   at org.apache.poi.hwpf.usermodel.TestBugs.test57843(TestBugs.java:911)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

GW <gr...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |greg.woolsey@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

--- Comment #2 from Dominik Stadler <do...@gmx.at> ---
There is a failing test for this at
org.apache.poi.hwpf.usermodel.TestBugs.test57603SevenRowTable which was added
via r1761873

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 57843] RuntimeException on extracting text from Word 97-2004 Document

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

Andreas Beeker <ki...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #3 from Andreas Beeker <ki...@apache.org> ---
Accidentally fixed it via r1876641, while refactoring the System.arraycopy
calls

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org